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Abstract 

Given a high-dimensional data set we often wish to find the strongest relationships within 
it. A common strategy is to evaluate a measure of dependence on every variable pair and 
retain the highest-scoring pairs for follow-up. This strategy works well if the statistic used 
is equitable [1], i.e., if, for some measure of noise, it assigns similar scores to equally noisy 
relationships regardless of relationship type (e.g., linear, exponential, periodic). 

In this paper, we introduce and characterize a population measure of dependence called 
MIC*. We show three ways that MIC* can be viewed: as the population value of MIC, a 
highly equitable statistic from [2], as a canonical “smoothing” of mutual information, and as 
the supremum of an infinite sequence defined in terms of optimal one-dimensional partitions 
of the marginals of the joint distribution. Based on this theory, we introduce an efficient 
approach for computing MIC* from the density of a pair of random variables, and we define 
a new consistent estimator MIC e for MIC* that is efficiently computable. In contrast, there 
is no known polynomial-time algorithm for computing the original equitable statistic MIC. 
We show through simulations that MIC e has better bias-variance properties than MIC. 
We then introduce and prove the consistency of a second statistic, TIC e , that is a trivial 
side-product of the computation of MIC e and whose goal is powerful independence testing 
rather than equitability. 

We show in simulations that MIC e and TIC e have good equitability and power against 
independence respectively. The analyses here complement a more in-depth empirical evalu¬ 
ation of several leading measures of dependence [3] that shows state-of-the-art performance 
for MIC e and TIC e . 


1 Introduction 

The growing dimensionality of today’s data sets has popularized the idea of hypothesis-generating 
science, whereby a data set is used not to test existing hypotheses but rather to help a researcher 
formulate new ones. A common approach among practitioners is to evaluate some statistic on 
many candidate variable pairs in a data set, sort the variable pairs from highest-scoring to 
lowest, and manually examine all the pairs above a threshold score [4, 5]. 
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A popular class of statistics used for such analyses is measures of dependence, i.e., statistics 
whose population value is 0 in cases of statistical independence and non-zero otherwise. Mea¬ 
sures of dependence are attractive because they guarantee that asymptotically no non-trivial 
relationship will erroneously be declared trivial. In the setting of continuous-valued data, which 
is our focus, there is a long line of fruitful research on such statistics including, e.g., [2, 6-15]. 

The utility of a measure of dependence (p can be assessed in two ways. The first is power 
against independence, i.e., the power of independence testing based on <p to detect various types 
of non-trivial relationships. This is an important goal for datasets that have very few non-trivial 
relationships, or only very weak relationships that are difficult to detect. Often, however, the 
number of relationships declared statistically significant by a measure of dependence greatly 
exceeds the number of relationships that can then be explored further. For example, biological 
datasets often contain many non-trivial relationships, but testing a preliminary finding for 
further corroboration may take extensive manual lab work, or a study on human or animal 
subjects. In this case, it is tempting to restrict follow-up to relationships with high values of 
ip, but this can skew the direction of follow-up work: if ip systematically assigns higher scores 
to, say, linear relationships than to non-linear ones, relatively noisy linear relationships might 
crowd out strong non-linear relationships from the top-scoring set. 

Motivated by this problem, in a companion paper [1] we define a second way of assessing a 
measure of dependence called equitability. Informally, an equitable statistic is one that, for some 
measure of relationship strength, assigns similar scores to equally strong relationships regardless 
of relationship type. For instance, we may want our measure of dependence to also have the 
property that on noisy functional relationships it assigns similar scores to relationships with the 
same R 2 , i.e., the squared Pearson correlation between the observed y-values and the x-values 
passed through the underlying function in question [2], Or, alternatively, we may want the 
value of our statistic to tell us about the proportion of points coming from the deterministic 
component of a mixture containing part signal and part uniform noise [16]. Defining measures 
of dependence that achieve good equitability with respect to interesting measures of relationship 
strength is a new and challenging problem, with a number of different formalizations. (See, e.g., 
[1] and [16] cited above, as well as [17] along with associated technical comments [18] and [19].) 

In this paper, we introduce and theoretically characterize two new measures of dependence 
that we empirically show to have good equitability with respect to R 2 and power against in¬ 
dependence, respectively. We begin by introducing a new population measure of dependence 
called MIC*. Given a pair of jointly distributed random variables (X,Y), MIC*(A, Y) is the 
supremum, over all finite grids G imposed on the support of ( X, Y ), of the mutual information 
of the discrete distribution induced by (X, Y) on the cells of G, subject to a regularization based 
on the resolution of G. We prove three results, each of which gives a different way that this 
population quantity can be viewed. 

1. MIC* is the population value of the maximal information coefficient (MIC), a statistic 
introduced in [2] that is highly equitable with respect to R 2 on a large class of noisy 
functional relationships. Simple corollaries of this result simplify and strengthen many of 
the theoretical results proven in [2] about MIC. 

2. MIC* is a minimal “smoothing” of mutual information, in the sense that the regularization 
in the definition of MIC* renders it uniformly continuous as a function of random variables, 
and no smaller regularization achieves continuity. A corollary of this is that MIC* is 
uniformly continuous while mutual information is not continuous. 
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3. MIC* is the supremum of an infinite sequence defined in terms of optimal partitions of the 
marginal distributions of (X, Y ) rather than optimal (two-dimensional) grids imposed on 
the joint distribution. This characterization greatly simplifies the computation of MIC* 
and associated quantities. 

After proving these three results, we leverage them to introduce efficient algorithms both 
for approximately computing MIC* and for estimating it from finite samples. We first provide 
an efficient algorithm that in many cases allows for computation to arbitrary precision of the 
MIC* of a pair of random variables whose joint density is known. We then introduce a statistic, 
called MIC e , that we prove is a consistent estimator of MIC*. In contrast to the MIC statistic 
from [2], for which no efficient algorithm is known and a heuristic algorithm is used in practice, 
MIC e is efficiently computable. It has a better runtime complexity than the heuristic algorithm 
currently in use for computing the original MIC statistic, and is orders of magnitude faster in 
practice. 

With a consistent and fast estimator for MIC* in hand, we turn to empirical analysis of 
its performance. Specifically, we show through simulation that MIC e has better bias/variance 
properties than the heuristic algorithm used in [2] for computing MIC, which has no theoretical 
convergence guarantees. Our analysis also reveals that the main parameter of MIC e can be 
used to tune statistical performance toward either stronger or weaker relationships in general. 
After studying the bias/variance properties of MIC e , we then demonstrate via simulation that 
it outperforms currently available methods in terms of equitability with respect to R 2 . Notably, 
we show this performance advantage both on the set of functional relationships analyzed in [2] 
as well as on a large set of randomly chosen noisy functional relationships. 

We choose in this paper to analyze equitability specifically with respect to R 2 , rather than 
some other notion of relationship strength, because R 2 on noisy functional relationships is 
a simple measure with broad familiarity and intuitive interpretation among practitioners. Of 
course, it is also important to develop measures of dependence that are equitable with respect to 
notions of relationship strength besides R 2 or on families of relationships besides noisy functional 
relationships; however, our focus here remains on the “simple” case of R 2 on noisy functional 
relationships. 

Importantly, we note that although there are methods for directly estimating the R 2 of a 
noisy functional relationship via nonparametric regression (see, e.g., [20, 21]), those methods are 
not applicable in the context of equitability because they are not measures of dependence. That 
is, because non-parametric regression methods assiLme a functional form for the relationship in 
question, they can give trivial scores to non-functional relationships, even in the large-sample 
limit. A simple example of this is when a distribution is supported on a circle, such that the 
regression function is constant. In contrast, a measure of dependence is guaranteed never to 
make this “mistake”. A measure of dependence that is equitable with respect to R 2 can there¬ 
fore be viewed either as an “upgraded” measure of dependence that also comes with some of 
the interpretability properties of non-parametric regression, or as an “upgraded” approximate 
non-parametric regression method that also has the robustness properties of a measure of de¬ 
pendence. 

The main strength of MIC e is equitability rather than power to reject a null hypothesis 
of independence. In some settings, though, it may be important to have good power against 
independence. We therefore introduce here a statistic closely related to MIC e called the total 
information coefficient TIC e . We prove the consistency of testing for independence using TIC e , 
and show via simulations that it achieves excellent power in practice, performing comparably 
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to or better than current methods. Because TIC e arises naturally as a side-product of the 
computation of MIC e , it is available “for free” once MIC e has been computed. This leads us 
to propose a data analysis strategy consisting of first using TIC e to filter out non-significant 
relationships, and then ranking the remaining ones using the simultaneously computed values 
of MIC e . 

In addition to the companion paper [1], which focuses on the theory behind equitability, 
this paper is accompanied by a second companion work [3] that explores in detail the empirical 
performance of the methods introduced here. That paper shows, by comparing MIC e and TIC e 
to several leading measures of dependence under many different sampling and noise models, 
that the equitability of MIC e on noisy functional relationships and the power of independence 
testing using TIC e are both state-of-the-art. It also shows that these methods can be computed 
very fast in practice. 

Taken together, our results shed significant light on the theory behind the maximal in¬ 
formation coefficient, and suggest that TIC e and MIC e are a useful pair of methods for data 
exploration. Specifically, they point to joint use of these two statistics to filter and then rank 
relationships as a fast and practical way to explore large data sets by measuring dependence 
both powerfully and equitably. 

2 Preliminaries 

We work extensively in this paper with grids and discrete distributions over their cells. Given 
a grid G and a point (x,y), we define the function row^y) to be the row of G containing y 
and we define co1g , (x) analogously. For a pair (X, Y) of jointly distributed random variables, 
we write ( X , T)|g to denote (col(j(X), rowG(Y)), and we use I((X, Y)|g) to denote the discrete 
mutual information [22-24] between coIg'(X) and row^Y). Given a finite sample D from the 
distribution of (X,Y), we sometimes use D to refer both to the set of points in the sample as 
well as to a point chosen uniformly at random from D. In the latter case, it will then make 
sense to talk about, e.g., D\q and I(D\q). 

For natural numbers k and £, we use G(k, £) to denote the set of all k-by-£ grids (possibly with 
empty rows/columns). A grid G is an equipartition of ( X , Y) if all the rows of ( X , Y)\q have the 
same probability mass, and all the columns do as well. We also use the term equipartition in the 
analogous way for one-dimensional partitions into just rows or columns. For a one-dimensional 
partition P into rows and a one-dimensional partition Q into columns, we write (P, Q) to refer 
to the grid constructed from these two partitions. When a partition P can be obtained from a 
partition P' by addition of separators alone, we write P' C P. 

Finally, let us establish some notation for infinite matrices. We use m°° to denote the space 
of infinite matrices equipped with the supremum norm. Given a matrix A £ m°°, we often 
examine only the k,£- th entries of A for which k£ < i for some i. Thus, for i £ Z + , we define 
the projection r* : m°° —> m°° via 



k£ < i 
k£ > i 


3 The population maximal information coefficient MIC* 

In this section, we define and characterize the population maximal information coefficient MIC*. 
We begin by defining the population quantity MIC* ( X , Y) for a pair of jointly distributed 
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random variables (A', Y ). We then show three different ways to characterize this population 
quantity: first, as the large-sample limit of the statistic MIC from [2]; second, as a minimally 
smoothed version of mutual information; and third, as the supremum of an infinite sequence 
defined in terms of optimal one-dimensional partitions of the marginals of the joint distribution 
of (A, A). We conclude the section by showing how the third characterization leads to an 
efficient approach for computing MIC* from the density of (A, A). 


3.1 Defining MIC* 

The population maximal information can be defined in several equivalent ways, as we will see 
later. For now, we begin with the simplest definition. 


Definition 3.1. Let (A, A) be jointly distributed random variables. The population maximal 
information coefficient (MIC*) of (A ,Y) is defined by 


MIC* (A, Y) = sup 
G 


I{(X,Y)\ g ) 
l°g 11^11 


where ||G|| denotes the minimum of the number of rows of G and the number of columns of G. 

Given that /(A, Y) = sup G /((A, A)|g) (see, e.g., Chapter 8 of [22]), this can be viewed as 
a regularized version of mutual information that penalizes complicated grids and ensures that 
the result falls between 0 and 1. 

Before we continue, we state one simple equivalent definition of MIC* that is useful for the 
results in this section. This definition views MIC* as the supremum of a matrix called the 
population characteristic matrix, defined below. 

Definition 3.2. Let (A ,Y) be jointly distributed random variables. Let 


/*((A,A),M) = max J((A,A)| G ). 

Gr G Cj ( K 


The population characteristic matrix of (A, Y), denoted by M(A, A), is defined by 


M(X,Y) k/ 


I*((X,Y),k,£) 
log min{fc, £} 


for k, £ > 1. 

It is easy to see the following: 

Proposition 1. Let (A ,Y) be jointly distributed random variables. We have 


MIC * (A, Y) = sup M(A, Y) 

where M(A, A) is the population characteristic matrix of (A ,Y). 

The population characteristic matrix is so named because just as MIC*, the supremum of this 
matrix, captures a sense of relationship strength, other properties of this matrix correspond to 
different properties of relationships. For instance, later in this paper we introduce an additional 
property of the characteristic matrix, the total information coefficient, that is useful for testing 
for the presence or absence of a relationship rather than quantifying relationship strength. 
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3.2 First alternate characterization: MIC* is the population value of MIC 


With MIC* defined, we now state our first alternate characterization of it, as the large-sample 
limit of the statistic MIC introduced in [2], We begin by first reproducing a description of MIC 
from [2], via the two definitions below. 


Definition 3.3 (2). Let D C M 2 be a set of ordered pairs. The sample characteristic matrix 
M(D ) of D is defined by 


M(D) kft 


logmin{fc,£} 


Definition 3.4 (2). Let D C M 2 be a set of n ordered pairs, and let B : Z + —> Z + . We define 


MICMD) = max M(D 

k,£<B{n) 


where the function B(ri) is specified by the user. In [2], it was suggested that B{n) be chosen 
to be n a for some constant a in the range of 0.5 to 0.8. (The statistics we introduce later will 
have an analogous parameter. See Section 4.2.1.) 

We have shown the following result about convergence of functions of the sample charac¬ 
teristic matrix to their population counterparts, a consequence of which is the convergence of 
MIC to MIC*. (In the theorem statement below, recall that m°° is the space of infinite matrices 
equipped with the supremum norm, and given a matrix A the projection ri zeros out all the 
entries A^j for which k£ > i.) 

Theorem 1. Let f : m°° —> M be uniformly continuous, and assume that f o n —> f pointwise. 
Then for every random variable (X,Y), we have 

( f ° 1 B(n)) (M(Z> n )) -+f(M{X,Y)) 

in probability where D n is a sample of size n from the distribution of (X,Y), provided w(l) < 
B{n ) < 0(n 1_e ) for some e > 0. 

Since the supremum of a matrix is uniformly continuous as a function on m°° and can be 
realized as the limit of maxima of larger and larger segments of the matrix, this theorem yields 
our claim about MIC* as a corollary. 

Corollary. MICb is a consistent estimator of MIC* provided 1) < B(n ) < 0(n 1-£ ) for some 

£ > 0 . 


We prove Theorem 1 in Appendix A and provide here some intuition for why it should hold 
as well as a description of the obstacles that must be overcome in the proof. 

To see why the theorem should hold, fix a random variable ( X , Y) and let D be a sample of 
size n from its distribution. It is known that, for a fixed grid G, I(D\q) is a consistent estimator 
of I((X,Y)\ g ) [9, 25]. We might therefore expect I*(D,k ,£) to be a consistent estimator of 
I*((X,Y),k,£) as well. And if I*(D,k ,£) is a consistent estimator of I*((X,Y),k ,£), then we 
might expect the maximum of the sample characteristic matrix (which just consists of normalized 
I* terms) to be a consistent estimator of the supremum of the true characteristic matrix. 

These intuitions turn out to be true, but there are two reasons they are non-trivial to prove. 
First, consistency for I* does not follow from abstract considerations since the maximum of 


6 



an infinite set of estimators is not necessarily a consistent estimator of the supremum of the 
estimands 1 . Second, consistency of I* alone does not suffice to show that the maximum of the 
sample characteristic matrix converges to MIC*. In particular, if B(n ) grows too quickly, and 
the convergence of I*(D,k,£) to I*((X,Y),k,£) is slow, inflated values of MIC can result. To 
see this, notice that if B(n ) = oo then MIC = 1 always, even though each individual entry of 
the sample characteristic matrix converges to its true value eventually. 

The technical heart of the proof is overcoming these obstacles by using the dependencies 
between the quantities I(D\q ) for different grids G to not only show the consistency of I*(D, k, £) 
but then to quantify how quickly I*(D,k,£) actually converges to I*((X,Y),k,£). 

3.3 Second alternate characterization: MIC* is a minimally smoothed mu¬ 
tual information 

We now describe a second equivalent view of MIC*. Recall that for a pair of jointly distributed 
random variables (X,Y), we defined MIC*(X, Y) as 

MIC* (X, Y) = sup 

G log 

where ||G|| denotes the minimum of the number of rows of G and the number of columns of G. 
As we discussed in Section 3.1, the mutual information I(X,Y ) is also a supremum, namely 

I(X,Y) = sup I((X,Y)\ g ). 

G 

and so MIC* can be viewed as a regularized version of /. It is natural to ask whether the 
regularization in the definition of MIC* has any smoothing effect on I. In this sub-section we 
show first that it does, in the sense that MIC* is uniformly continuous as a function of random 
variables with respect to the metric of statistical distance 2 , and second that the regularization by 
log HGII is in fact the minimal one necessary for achieving any sort of continuity. As a corollary, 
we obtain that I by itself is not continuous as a function of random variables with respect to 
the metric of statistical distance. This yields a view of MIC* as a canonical smoothing of I that 
yields continuity. 

Formally, let ^(M 2 ) denote the space of random variables supported on M 2 equipped with 
the metric of statistical distance. Our first claim is that as a function defined on "P(M 2 ), MIC* 
is uniformly continuous. We prove this claim by establishing a stronger result: the uniform 
continuity of the characteristic matrix M (A, Y). Specifically, by showing that the family of maps 
corresponding to each individual entry of the characteristic matrix is uniformly equicontinuous, 
we establish the following result. 

Theorem 2. The map from "P(M 2 ) to m°° defined by (X,Y) i —> M(X,Y) is uniformly contin¬ 
uous. 

1 If 9i,...,6k is a finite set of estimators, then a union bound shows that the random variable 
... ,9k{D)) converges in probability to (9i,... ,9k) with respect to the supremum metric. The con¬ 
tinuous mapping theorem then gives the desired result. However, if the set of estimators is infinite, the union 
bound cannot be employed. And indeed, if we let 9± = ■ ■ ■ = 9k = 0, and let 9t(D n ) = i/n deterministically, then 
each 9i is a consistent estimator of but since the set {9i(D„), 92(D n ), ...} = {1 /n, 2/n ,...} is unbounded, 
sup; 9i(D n ) = oo for every n. 

2 Recall that the statistical distance between random variables A and B is defined as 
sup T |P(AeT)-P(BeT)|. When A and B have probability density functions or probability mass 
functions, this equals one-half of the L 1 distance between those functions. 
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Proof. See Appendix B. 


□ 


Since the supremum is a continuous function on m°°, Theorem 2 yields the following corol¬ 
lary. 


Corollary. The map (X, Y) i —> MIC*(X,Y ) is uniformly continuous. 


Similar corollaries exist for any continuous function of the characteristic matrix. 

Interestingly, Theorem 2 relies crucially on the normalization in the definition of the charac¬ 
teristic matrix. This is not a coincidence: as the following proposition shows, any normalization 
that is meaningfully smaller than the one in the definition of the characteristic matrix will cause 
the matrix to contain an infinite discontinuity as a function on ^(M 2 ). 


Proposition 2. For some function N(k,£), let M N be the characteristic matrix with normal¬ 
ization N, i.e., 


M n (X, y ) 


I*((X,Y),k,£) 

N{k,e) 


If N(k,£) = o(log min{/c, £}) along some infinite path in N X N, then M N and sup M N are not 
continuous as functions of P([0,1] X [0,1]) C ^(M 2 ). 


Proof. See Appendix C 


□ 


The above proposition implies that the “smoothing” that MIC* applies to mutual informa¬ 
tion is necessary in some sense. In particular, one corollary of the proposition is that mutual 
information with no smoothing will contain a disconuity. 

Corollary. Mutual information is not continuous on V([0, 1] x [o,i])c?(i 2 ). 

Proof. Mutual information is the supremum of M N with N = 1. □ 

The same result can also be shown for the squared Linfoot correlation [26, 27], which equals 
1 — 2~ 21 where I represents mutual information. Thus, though the Linfoot correlation smoothes 
the mutual information enough to cause it to he in the unit interval, it does not smooth the 
mutual information sufficiently to cause it to be continuous. 

As we remarked previously, these results, when contrasted with the uniform continuity of 
MIC*, allow us to view the latter as a canonical “minimally smoothed” version of mutual informa¬ 
tion that is uniformly continuous. This view gives a meaningful interpretation to the normaliza¬ 
tion used in MIC*. Understanding MIC* as having smoothness properties not shared by mutual 
information also suggests that estimators of MIC* may have better statistical properties than 
estimators of ordinary mutual information. This is consistent with the hardness-of-estimation 
result in [16] and is also borne out empirically in [3]. 


3.4 Third alternate characterization: MIC* is the supremum of the bound¬ 
ary of the characteristic matrix 

We now show the third alternate view of MIC*: that it can be equivalently defined as the 
supremum over a boundary of the characteristic matrix rather than as a supremum over all of 
the entries of the matrix. This characterization of MIC* will serve as the foundation both for 
our approach to computing MIC* (A, Y) as well as the new estimator of MIC* that we introduce 
later in this paper. 


8 



We begin by defining what we mean by the boundary of the characteristic matrix. Our 
definition rests on the following observation. 

Proposition 3. Let M be a population characteristic matrix. Then for £ > k, M k j < M k \. 

Proof. Let (X,Y) be the random variable in question. Since we can always let a row/column 
be empty, we know that I*((X,Y),k,£) < I*((X, Y), k,£ + 1). And since £,£ + 1 > k, we know 
that M k/ = I* {{X,Y),k,£)f log k < I*({X,Y),k,£ + l)/logk = m m+1 . □ 

Since the entries of the characteristic matrix are bounded, the monotone convergence the¬ 
orem then gives the following corollary. In the corollary and henceforth, we let M k ^ = 
lirn^oo M k i and define M+ n similarly. 

Corollary. Let M be a population characteristic matrix. Then M k ^ exists, is finite, and equals 
sup e>k M k £. The same is true for AT- £. 

The above corollary allows us to define the boundary of the characteristic matrix. 
Definition 3.5. Let M be a population characteristic matrix. The boundary of M is the set 

dM = {M k ^ : 1 < k < oo} |^J{Mj-^ : 1 < £ < oo}. 

The theorem below then gives a relationship between the boundary of the characteristic 
matrix and MIC*. 

Theorem 3. Let (X,Y) be a random variable. We have 

MIC * (X, Y) = sup dM(X, Y) 

where M(X,Y) is the population characteristic matrix of (X, Y). 

Proof. The following argument shows that every entry of M is at most sup<9M: fix a pair (k,£) 
and notice that either k < £, in which case M k ,t < Ad k ^, or £ < k, in which case M k j < M^. 
Thus, MIC* < sup{Mj-^} U {Mfc -j.} = sup dM. 

On the other hand, Corollary 3.4 shows that each element of dAI is a supremum over some 
elements of M. Therefore, sup dM. being a supremum over suprema of elements of M, cannot 
exceed supM = MIC*. □ 


3.5 Computing MIC* efficiently 

The importance of the characterization in Theorem 3 from the previous sub-section is compu¬ 
tational. Specifically, elements of the boundary of the characteristic matrix can be expressed in 
terms of a maximization over (one-dimensional) partitions rather than (two-dimensional) grids, 
the former being much quicker to compute exactly. This is stated in the theorem below. 


Theorem 4. Let M be a population characteristic matrix. Then M k ^ equals 

I(X, Y\ P ) 


max 

PeP(fc) 


log k 


where P(k ) denotes the set of all partitions of size at most k. 
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Proof. See Appendix D. 


□ 


To formally state how this will help us compute MIC*, we note that Theorems 3 and 4 above 
together give the following corollary. 

Corollary. Let [X, Y) be a random variable, and let P be the set of finite-size partitions. Then 


MIC.fiX, Y) = sup 


I(X,Y\ P ) 
log |P| 


: P G 


U 


I(X\p,Y) 

log IPI 


: P G 


where |P| is the number of bins in the partition P. 


The expressions in the above corollary involve maximization only over one-dimensional par¬ 
titions rather than two-dimensional grids. We can exploit this fact to give an algorithm for 
computing elements of the boundary of the characteristic matrix to arbitrary precision. To do 
so, we utilize as a subroutine a dynamic programming algorithm from [2] called OPTIMIZEX- 
AxiS. Before continuing, we therefore give a brief overview of that algorithm. 


Overview of OptimizeXAxis algorithm from [2] The OptimizeXAxis algorithm takes 
as input a set D of n data points, a fixed partition into columns 3 Q of size £, a “master” partition 
into rows II, and a number k. The algorithm returns, for 2 < i < k, the partition into rows 
Pj C II that maximizes the mutual information of D\^ p . q^ among all sub-partitions of II of size 
at most i. The algorithm works by exploiting the fact that, conditioned on the location y of the 
top-most line of Pi, the optimization of the rest of Pj can be formulated as a sub-problem that 
depends only on the data points below y. The algorithm uses dynamic programming to store 
and reuse solutions to these subproblems, resulting in a runtime of 0(\H\ 2 k£). If a black-box 
algorithm is used to compute each required mutual information in time at most T, then the 
runtime of the algorithm can be shown to be 0(Tk\U\). 

The following theorem shows that the theory developed about the boundary of the char¬ 
acteristic matrix, together with OptimizeXAxis, yields an efficient algorithm for computing 
entries of the boundary to arbitrary precision. 

Theorem 5. Given a random variable (X,Y), M^ •+. (resp. M +-() is computable to within an 
additive error of O(helog(l/(he)))+ E (resp. 0(£e\og{l/(£e))) + E) in time 0{kT(E)/e) (resp. 
0(£T{E)/e)), where T(E) is the time required to numerically compute the mutual information 
of a continuous distribution to within an additive error of E. 

Proof. See Appendix E. □ 

The algorithm proposed in Theorem 5 gives us a polynomial-time method for computing 
any finite subset of the boundary dM of the population characteristic matrix M{X, Y) of a 
random variable (X,Y). Thus, if we have some kofio such that the maximum of the finite 
subset {Mfc-f, : k < ko,£ < £fi\ of dM will be e-close to the supremum of the entire 
set dM, we can compute MIC*(X, Y) to within an error of e. Though we usually do not have 
precise knowledge of ko and £q, for simple distributions it is often easy to make very conservative 

‘^Despite its name, the OptimizeXAxis algorithm can be used to optimize a partition of either axis. In our 
description of the algorithm here, we choose to describe the algorithm as it would work for optimizing a partition 
of the y-axis rather than the x-axis. This is for notational coherence of this paper only. 
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educated guesses for them. This algorithm therefore allows us to approximate MIC* (X, Y) very 
well in practice. 

Being able to compute MIC*(X, Y) has two main advantages. The first is that it allows 
us to assess in simulations the large-sample properties of MIC* independent of any estimator. 
This is done in the companion paper [3], which shows that MIC* achieves high equitability with 
respect to R 2 on a set of noisy functional relationships thereby confirming that statistically 
efficient estimation of MIC* is a worthwhile goal. 

The second advantage of being able to compute MIC*(X, Y) is that we can empirically assess 
the bias, variance, and expected squared error of estimators of MIC* by taking a distribution, 
computing MIC*, and then comparing the result to estimates of it based on finite samples. In 
the next section, we introduce a new estimator MIC e of MIC* and carry out such an analysis 
to compare its statistical properties to those of the statistic MIC from [2], 

4 Estimating MIC* with MIC e 

As we have shown, MIC* is actually the population value of the statistic MIC introduced in [2], 
However, though consistent, the statistic MIC is not known to be efficiently computable and 
in [2] a heuristic approximation algorithm called APPROX-MIC was computed instead. In this 
section, we leverage the theory we have developed here to introduce a new estimator of MIC* 
that is both consistent and efficiently computable. The new estimator, called MIC e , actually 
has better runtime complexity even than the heuristic APPROX-MIC algorithm, and runs orders 
of magnitude faster in practice. 

The estimator MIC e is based on one of the alternate characterizations of MIC* proven in 
the previous section. Namely, if MIC* can be viewed as the supremum of the boundary of the 
characteristic matrix rather than of the entire matrix, then only the boundary of the matrix 
must be accurately estimated in order to estimate MIC*. This has the advantage that, whereas 
computing individual entries of the sample characteristic matrix involves finding optimal (two- 
dimensional) grids, estimating entries of the boundary requires us only to find optimal (one¬ 
dimensional) partitions. While the former problem is computationally difficult, the latter can be 
solved using the dynamic programming algorithm from [2] that we also employed in Section 3.5 
to compute MIC* in the large-sample limit. 

We formalize this idea via a new object called the equicharacteristic matrix , which we denote 
by \M], The difference between [M] and the characteristic matrix M is as follows: while 
the k ,£-th entry of M is computed from the maximal achievable mutual information using 
any k-by-£ grid, the k ,£-th entry of [M] is computed from the maximal achievable mutual 
information using any k-by-£ grid that equipartitions the dimension with more rows/columns. 
(See Figure 1.) Despite this difference, as the equipartition in question gets finer and finer it 
becomes indistinguishable from an optimal partition of the same size. This intuition can be 
formalized to show that the boundary of [M] equals the boundary of M. and therefore that 
sup[M] = supAf = MIC*. It will then follow that estimating \M] and taking the supremum - 
as we did with M in the case of MIC - yields a consistent estimate of MIC*. 

4.1 The equicharacteristic matrix 

We now define the equicharacteristic matrix and show that its supremum is indeed MIC*. To 
do so, we first define a version of I* that equipartitions the dimension with more rows/columns. 
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M(X, Y) 2 , 3 [M\(X,Y ) 2f3 

I* = 0.918 /W = 0.613 


[M\(X, Y) 2 , 9 

I H = 0.918 

Figure 1: A schematic illustrating the difference between the characteristic matrix M and the 
equicharacteristic matrix [M]. (Top) When restricted to 2 rows and 3 columns, the character¬ 
istic matrix M is computed from the optimal 2-by-3 grid. In contrast, the equicharacteristic 
matrix [M] still optimizes the smaller partition of size 2 but is restricted to have the larger 
partition be an equipartition of size 3. This results in a lower mutual information of 0.613. 
(Bottom) When 9 columns are allowed instead of 3, the grid found by the equicharacteristic 
matrix does not change, since the grid with 3 columns was already optimal. However, now the 
equicharacteristic matrix uses an equipartition into columns of size 9, whose resolution is able 
to fully capture the dependence between X and Y . 


M(X,Y ) 2 ,9 
I* = 0.918 


Note that in the definition, brackets are used to indicate the presence of an equipartition. 
Definition 4.1. Let (X,Y) be jointly distributed random variables. Define 

I*((X,Y),k,[£]) = max I((X,Y)\ G ) 

GeG(k,[£\) 

where G(k, [£]) is the set of k-by-£ grids whose y-axis partition is an equipartition of size £. 
Define I*((X, Y), [k],£) analogously. 

Define I^((X,Y),k,£) to equal I*((X,Y),k, [£]) if k < £ and I*((X,Y), \k],t) otherwise. 

We now define the equicharacteristic matrix in terms of I'M. In the definition below, we 
continue our convention of using brackets around a quantity to denote the presence of equipar- 
titions. 

Definition 4.2. Let (X,Y) be jointly distributed random variables. The population equichar¬ 
acteristic matrix of (X, Y), denoted by [M](X,Y), is defined by 

lW((X,Y),k,£) 

log rninjfc, £} 

for k , £ > 1. 
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The boundary of the equicharacteristic matrix can be defined via a limit in the same way 
as the characteristic matrix. We then have the following theorem. 

Theorem 6. Let ( X, Y ) be jointly distributed random variables. Then d[M] = dM. 

Proof. See Appendix F. □ 

Since every entry of the equicharacteristic matrix is dominated by some entry on its bound¬ 
ary, the equivalence of d[M] and dM yields the following corollary as a simple consequence. 

Corollary. Let ( X , Y) be jointly distributed random variables. Then sup[M](X, Y) = MIC*(X, Y). 


4.2 The estimator MIC e 

With the equicharacteristic matrix defined, we can now define our new estimator MIC e in terms 
of the sample equicharacteristic matrix, analogously to the way we defined MIC in terms of the 
sample characteristic matrix. 


Definition 4.3. Let D C l 2 

[M](D) of D is defined by 


be a set of ordered pairs. The 


[M](D) k/ 


/M(T>, k,T) 

logmin{A:,£} 


sample equicharacteristic matrix 


Definition 4.4. Let D C M 2 be a set of n ordered pairs, and let B : Z + —> Z + . We define 


MIC e>B (D)= max [M\(D) k/ . 

k£<B(n) 


With the equivalence between the boundaries of the characteristic matrix and the equichar¬ 
acteristic matrix established, it is straightforward to show that MIC e is a consistent estimator 
of MIC* via arguments similar to those we applied in the case of MIC. (See Appendix G.) 
Specifically, we show the following theorem, an analogue of Theorem 1. 


Theorem 7. Let f : m°° —> M be uniformly continuous, and assume that f o n —> f pointwise. 
Then for every random variable (X,Y'), we have 

(/ or BW ) ([Mj(AO) ^f([M](X,Y)) 


in probability where D n is a sample of size n from the distribution of (X, Y), provided w(l) < 
B(n) < 0(n l ~ £ ) for some e > 0. 

By setting /([M]) = sup[M], we then obtain as a corollary the consistency of MIC e . 

Corollary. MIC e % is a consistent estimator of MIC* provided w(l) < B(n) < 0(n 1_£ ) for 
some £ > 0. 
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4.2.1 Choosing B(n) 

As with the statistic MIC, the statistic MIC e requires the user to specify a function B(n ) to 
use. While the theory suggests that any function of the form B(n) = n a suffices provided 
0 < a < 1, different values of a. may yield different finite-sample properties. We study the 
empirical performance of MIC e for different choices of B(n) in Section 4.4. 

[3] provides simple, empirical recommendations about appropriate values of a. for different 
settings. Those recommendations are formulated by choosing a set of representative relation¬ 
ships (e.g., a set of noisy functional relationships), as well as a “ground truth” population 
quantity $ (e.g., R 2 ) that can be used to quantify the strength of each of those relationships, 
and then assessing which values of a maximize the equitability of MIC e with respect to at a 
given sample size. This approach is applied to an analysis of real data from the World Health 
Organization in [3], and the parameters chosen for that analysis are the ones used for all analyses 
in this paper. 

We remark that if the goal of the user is only detection of non-trivial relationships rather 
discovery of the strongest such relationships, a can also be chosen in a more straightforward 
manner: the user can sub-sample a small random set of relationships on which to compare the 
power of MIC e for different values of a. Those relationships can then be discarded and the rest 
of the relationships analyzed with the optimal value of a. However, if the user’s primary goal 
is power against independence, the statistic TIC e introduced in Section 5 of this paper should 
be used with this strategy rather than MIC e . 

4.3 Computing MIC e 

Both MIC and MIC e are consistent estimators of MIC*. The difference between them is that 
while MIC can currently be computed efficiently only via a heuristic approximation, MIC e can 
be computed exactly very efficiently via an approach similar to the one used for computing 
MIC* involving the OptimizeXAxis subroutine. We now describe the details of this approach. 

Recall that, given a fixed x-axis partition Q into l columns, a set of n data points, a 
“master” y-axis partition n, and a number k, the OptimizeXAxis subroutine finds, for every 
2 < i < k, a y-axis partition Pi C n of size at most i that maximizes the mutual information 
induced by the grid ( Pi,Q ). The algorithm does this in time 0(\B\ 2 k£). (For more discussion 
of OptimizeXAxis, see Section 3.5, where it is also used to give an algorithm for computing 
MIC*.) 

In the pair of theorems below, we show two ways that OptimizeXAxis can be used to 
compute MIC e efficiently. In the proofs of both theorems, we neglect issues of divisibility, i.e., 
we often write B/2 rather than \_B / 2j. This does not affect the results. 

Theorem 8. There exists an algorithm EQUICHAR that, given a sample D of size n and some 
B E Z + , computes the portion r^^([M](D)) of the sample equicharacteristic matrix in time 
0(n 2 B 2 ), which equals 0(n 4_2s ) for B(n ) = 0(n l ~ e ) with e > 0. 

Proof. We describe the algorithm and simultaneously bound its runtime. We do so only for the 
k,£- th entries of [M](D] satisfying k < £,k£ < B. This suffices, since by symmetry computing 
the rest of the required entries at most doubles the runtime. 

To compute [M](D)k,e with k < £, we must fix an equipartition into £ columns on the x-axis 
and then find the optimal partition of the y-axis of size at most k. If we set the master partition 
n of the OptimizeXAxis algorithm to be an equipartition into rows of size n, then it performs 
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precisely the required optimization. Moreover, for fixed £ it can carry out the optimization 
simultaneously for all of the pairs {(2,£),..., (B/£,£)} in time 0(\T\\ 2 (B/£)£) = 0(n 2 B). For 
fixed £, this set contains all the pairs (fc, £) satisfying k < £,kl < B. Therefore, to compute 
all the required entries of [M](D) we need only apply this algorithm for each £ = 2,..., B/ 2. 
Doing so gives a runtime of 0{n 2 B 2 ). □ 

The algorithm above, while polynomial-time, is nonetheless not efficient enough for use in 
practice. However, a simple modification solves this problem without affecting the consistency 
of the resulting estimates. The modification hinges on the fact that OptimizeXAxis can use 
master partitions n besides the equipartition of size n that we used above. Spehcally, setting n 
in the above algorithm to be an equipartition into ck “clumps”, where k is the size of the largest 
optimal partition being sought, speeds up the computation significantly. This modification does 
give a slightly different statistic. However, the result is still a consistent estimator of MIC* 
because the size of the master partition n grows as k grows, and so the optimal sub-partition of 
n approaches the true optimal partition eventually. These ideas, first about improved runtime 
and second about preserved consistency, are formalized in the following theorem. 

Theorem 9. Let (X, Y) be a pair of jointly distributed random variables, and let D n be a sample 
of size n from the distribution of (X, Y). For every c > 1, there exists a matrix {M} c (D n ) such 
that 

1. The function 

MIC e B {-) = max {M} c (-) k / 

U<B(n) 

is a consistent estimator of MIC* provided cu(l) < B{n) < 0(n l ~ £ ) for some e > 0. 

2. There exists an algorithm EquicharClump for computing rs({M} c (D n )) in time 0(n + 
.B 5 / 2 ), which equals 0(n + n 5 ^ 1_£ ^ 2 ) when B{n) = 0(n l ~ £ ). 

Proof See Appendix H. □ 

For an analysis of the effect of the parameter c in the above theorem on the results of the 
EquicharClump algorithm, see Appendix H.3. 

Setting e = 0.6 in the above theorem yields the following corollary. 

Corollary. MIC * can be estimated consistently in linear time. 

Of course, at low sample sizes, setting e = 0.6 would be undesirable. However, our companion 
paper [3] shows empirically that at large sample sizes this strategy works very well on typical 
relationships. 

We remark that the EQUICHARCLUMP algorithm given above is asymptotically faster even 
than the heuristic APPROX-MIC algorithm used to calculate MIC in practice, which runs in 
time 0(B(n) 4 ). As demonstrated in our companion paper [3], this difference translates into a 
substantial difference in runtimes for similar performance at a range of realistic sample sizes, 
ranging from a 30-fold speedup at n = 500 to over a 350-fold speedup at n = 10, 000. 

For readability, in the rest of this paper we do not distinguish between the two versions of 
MIC e computed by the EQUICHAR and EQUICHARCLUMP algorithms described above. Wher¬ 
ever we present simulation data about MIC e in simulations though, we use the version of the 
statistic computed by EquicharClump. 
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Figure 2: Bias/variance characterization of APPROX-MIC and MIC e . Each plot shows expected 
squared error, bias, or variance across the set of noisy functional relationships described in 
Section 4.4 as a function of the R 2 of the relationships. The results are aggregated across 
the 16 function types analyzed by either the average, median, or worst result at every value 
of R?. (a) A comparison between MIC e (light purple) and MIC (black) as computed via the 
heuristic APPROX-MIC algorithm, at a typical usage parameter. (b) Performance of MIC e 
with B(n ) = n a for various values of a. 
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4.4 Bias/variance characterization of MIC e 

The algorithm we presented in Section 3.5 for computing MIC* in the large-sample limit allows 
us to examine the bias/variance properties of estimators of MIC*. Here, we use it to examine 
the bias and variance of both MIC as computed by the heuristic APPROX-MIC algorithm from 
[2], and MIC e as computed by the EquicharClump algorithm given above. To do this, we 
performed a simulation analysis on the following set of relationships 

Q = {(x + e a ,f(x ) +e' a ) : x <E X f ,e a ,e' a ~ A/"(0, a 2 ), f <E F, a <£ M> 0 } 

where e a and e/. are i.i.d., F is the set of 16 functions analyzed in [2], and Xf is the set of n 
x-values that result in the points (xj, /(xj)) being equally spaced along the graph of /. 

For each relationship Z £ Q that we examined, we used the algorithm from Theorem 5 to 
compute MIC*. We then simulated 500 independent samples from Z, each of size n = 500, 
and computed both APPROX-MIC and MIC e on each one to obtain estimates of the sampling 
distributions of the two statistics. From each of the two sampling distributions, we estimated 
the bias and variance of either statistic on Z. We then analysed the bias, variance, and expected 
squared error of the two statistics as a function of relationship strength, which we quantified 
using the coefficient of determination (R 2 ) with respect to the generating function. 

The results, presented in Figure 2, are interesting for two reasons. First, they demonstrate 
that for a typical usage parameter of B{n) = n 0,6 , MIC e performs substantially better than 
APPROX-MIC overall. Specifically, the median of the expected squared error of MIC e across 
the set F of functions is uniformly lower across R 2 values than that of APPROX-MIC. When 
average expected squared error is used instead of median, MIC e still performs better on all 
but the strongest of relationships ( R 2 above ~0.9). The superior performance of MIC e is 
consistent with the fact that we have theoretical guarantees about its statistical properties 
whereas Approx-MIC is a heuristic. 

Second, the results show that different values of the exponent in B(n) = n a give good 
performance in different signal-to-noise regimes due to a bias-variance trade-off represented 
by this parameter. Large values of a lead to increased expected error in lower-signal regimes 
(low R?) through both a positive bias in those regimes and a general increase in variance that 
predominantly affects those regimes. On the other hand, small values of a lead to an increased 
expected error in higher-signal regimes (high R 2 ) by leading to a negative bias in those regimes 
and by shifting the variance of the estimator toward those regimes. In other words, lower values 
of a are better-suited for detecting weaker signals, while higher values of a are better suited for 
distinguishing among stronger signals. This is consistent with the results seen in our companion 
paper [3], which show that low values of a cause MIC e to yield better powered independence 
tests while high values of a cause MIC e to have better equitability. For a detailed discussion of 
this trade-off and of specific recommendations for how to set a in practice, see [3]. 

4.5 Equitability of MIC e 

As mentioned previously, one of the main motivations for the introduction of MIC was equitabil¬ 
ity, the extent to which a measure of dependence usefully captures some notion of relationship 
strength on some set of standard relationships. We therefore carried out an empirical analy¬ 
sis of the equitability of MIC e with respect to R 2 and compared its performance to distance 
correlation [10, 28], mutual information estimation [29], and maximal correlation estimation [8]. 
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We began by assessing eqnitability on the set of relationships Q defined above, a set that has 
been analyzed in previous work [2, 3, 17]. The results, shown in Figure 3, confirm the superior 
equitability of the new estimator MIC e on this set of relationships. 

To assess equitability more objectively without relying on a manually curated set of func¬ 
tions, we then analyzed 160 random functions drawn from a Gaussian process distribution with 
a radial basis function kernel with one of eight possible bandwidths in the set 

{0.01,0.025,0.05,0.1,0.2, 0.25, 0.5,1} 

to represent a range of possible relationship complexities. The results, shown in Figure 4, show 
that MIC e outperforms currently existing methods in terms of equitability with respect to R 2 
on these functions as well. Appendix Figure J1 shows a version of this analysis under a different 
noise model that yields the same conclusion. We also examined the effect of outlier relationships 
on our results by repeatedly subsampling random subsets of 20 functions from this large set of 
relationships and measuring the equitability of each method on average over the subsets; results 
were similar. 

One feature of the performance of MIC e on these randomly chosen relationships that is 
demonstrated in Figure 4 is that it appears minimally sensitive to the bandwidth of the Gaussian 
process from which a given relationship is drawn. This puts it in contrast to, e.g., mutual 
information estimation, which shows a pronounced sensitivity to this parameter that prevents 
it from being highly equitable when relationships with different bandwidths are present in the 
same dataset. 

In our companion paper [3], we perform more in-depth analyses of the equitability with re¬ 
spect to R 2 of MIC e , MIC, and the four measures of dependence described above as well as the 
Hilbert-Schmidt independence criterion (HSIC) [11, 30], the Heller-Heller-Gorfine (HHG) test 
[14], the data-derived partitions (DDP) test [31], and the randomized dependence coefficient 
(RDC) [13]. These analyses consider a range of sample sizes, noise models, marginal distri¬ 
butions, and parameter settings. They conclude that, in terms of equitability with respect to 
R 2 on the sets of noisy functional relationships studied, a) MIC e uniformly outperforms MIC, 
and b) MIC e outperforms all the methods tested in the vast majority of settings examined. 
Appendix Figure II contains a reproduction of a representative equitability analysis from that 
paper for the reader’s reference. 

5 The total information coefficient 

So far we have presented results about estimators of the population maximal information co¬ 
efficient, a quantity for which equitability is the primary motivation. We now introduce and 
analyze a new measure of dependence, the total information coefficient (TIC). In contrast to the 
maximal information coefficient, the total information coefficient is designed not for equitability 
but rather as a test statistic for testing a null hypothesis of independence. 

We begin by giving some intuition. Recall that the maximal information coefficient is the 
supremum of the characteristic matrix. While estimating the supremum of this matrix has many 
advantages, this estimation involves taking a maximum over many estimates of individual entries 
of the characteristic matrix. Since maxima of sets of random variables tend to become large as 
the number of variables grows, one can imagine that this procedure will lead to an undesirable 
positive bias in the case of statistical independence, when the population characteristic matrix 
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Figure 3: Equitability with respect to R 2 on a set of noisy functional relationships of (a) 
the Pearson correlation coefficient, (b) a hypothetical measure of dependence <p with perfect 
equitability, (c) distance correlation, (d) MIC e , (e) maximal correlation estimation, and (f) 
mutual information estimation. For each relationship, a shaded region denotes 5th and 95th 
percentile values of the sampling distribution of the statistic in question on that relationship 
at every R 2 . The resulting plot shows which values of R 2 correspond to a given value of each 
statistic. The red interval on each plot indicates the widest range of R 2 values corresponding to 
any one value of the statistic; the narrower the red interval, the higher the equitability. A red 
interval with width 0, as in (b), means that the statistic reflects only R 2 with no dependence 
on relationship type, as demonstrated by the pairs of thumbnails of relationships of different 
types with identical R 2 values. 
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Figure 4: Equitability of methods examined on functions randomly drawn from a Gaussian 
process distribution. Each method is assessed as in Figure 3, with a red interval indicating 
the widest range of R 2 values corresponding to any one value of the statistic; the narrower the 
red interval, the higher the equitability. Each shaded region corresponds to one relationship, 
and the regions are colored by the bandwidth of the Gaussian process from which they were 
sampled. Sample relationships for each bandwidth are shown in the top right with matching 
colors. 
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equals 0. This is detrimental for independence testing, when the sampling distribution of a 
statistic under a null hypothesis of independence is crucial. 

The intuition behind the total information coefficient is that if we instead consider a more 
stable property, such as the sum of the entries in the characteristic matrix, we might expect to 
obtain a statistic with a smaller bias in the case of independence and therefore better power. 
Stated differently, if our only goal is to distinguish any dependence at all from complete noise, 
then disregarding all of the sample characteristic matrix except for its maximal value may throw 
away useful signal, and the total information coefficient avoids this by summing all the entries. 

We remark that in [2] it is suggested that other properties of the characteristic matrix may 
allow us to measure other aspects of a given relationship besides its strength, and several such 
properties were defined. The total information coefficient fits within this conceptual framework. 

In this section we define the total information coefficient in the case of both the characteristic 
matrix (TIC) and the equicharacteristic matrix (TIC e ). We then prove that both TIC and 
TIC e yield independence tests that are consistent against all dependent alternatives. Finally, 
we present a simulation study of the power of independence testing based on TIC e , showing 
that it outperforms other common measures of dependence. 

5.1 Definition and consistency of the total information coefficient 

We begin by defining the two versions of the total information coefficient. In the definition 
below, recall that M denotes a sample characteristic matrix whereas [M] denotes a sample 
equicharacteristic matrix. 

Definition 5.1. Let D C M 2 be a set of n ordered pairs, and let B : Z + —> Z + . We define 

TIC b (D)= Y M(D) k , e 

ki<B(n ) 

and 

TIC e>B (D)= Y 

k£<B(n) 

To show that these two statistics lead to consistent independence tests, we must take a step 
back and analyze the behavior of the analogous population quantities. We take some care with 
the limits involved in defining these quantities, since the fine-grained behavior of these limits 
will be a key part of our proof of consistency. 

Definition 5.2. For a matrix A and a positive number B , the B-partial sum of A. denoted by 
S b (A), is 

Sb(A) = Y, Ak,e- 

ki<B 

When A is an (equi)characteristic matrix, Sb(A) is the sum over all entries corresponding 
to grids with at most B total cells. Thus, if M(D ) is a sample characteristic matrix of a sample 
D, Sb(M(D )) = TIC^(D), and the same holds for [M](D) and TIC e ^(D). 

It is clear that if X and Y are statistically independent random variables, then both the 
characteristic matrix M{X,Y ) and the equicharacteristic matrix [M](X, Y) are identically 0, 
so that Sb(M(X,Y)) = Sb([M](X, Y)) = 0 for all B. However, we are also interested in how 
these quantities behave when X and Y are dependent. The following pair of propositions helps 
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us understand this. The first proposition shows a lower bound on the values of entries in both 
M(X,Y) and [M](X,Y). The second proposition translates this into an asymptotic character¬ 
ization of how quickly S'b(M) and Sb ( [M ]) grow as functions of B. These two propositions 
are the technical heart of why the total information coefficient yields a consistent independence 
test. 


Proposition 4. Let (X,Y) be a pair of jointly distributed random variables. If X and Y are 
statistically independent, then M{X,Y) = \M] (X, Y) = 0. If not, then there exists some a > 0 
and some integer Iq > 2 such that 


M(X,Y) k/ ,[M](X,Y) kjt > 


a 

logmin{A:,£} 


either for all k > I > Iq, or for all l > k > £q. 


Proof. See Appendix K.l □ 

Proposition 5. Let (X,Y) be a pair of jointly distributed random variables. If X and Y are 
statistically independent, then Sb{M(X,Y )) = Sb([M](X,Y)) = 0 for all B > 0. If not, then 
Sb{M(X,Y)) and Sb([M](X,Y)) are both fl(B log log B). 

Proof. See Appendix K.2 □ 

The propositions above, together with reasoning analogous to the convergence arguments 
presented earlier, can be used to show the main result of this section, namely that the statistics 
TIC and TIC e yield consistent independence tests. 


Theorem 10. The statistics TICb and TIC e b yield consistent right-tailed tests of indepen¬ 
dence, provided w(l) < Bin) < 0(n 1-£ ) for some e > 0. 

Proof. See Appendix K.3. □ 

In practice, we often use the EquicharClump algorithm (see Section 4.3) to compute the 
equicharacteristic matrix from which we calculate TIC e . This algorithm does not compute 
the sample equicharacteristic matrix exactly. However, as in the case of MIC e , the use of the 
algorithm does not affect the consistency of our approach for independence testing. This is 
proven in Appendix H. 


5.2 Power of independence tests based on TIC e 

With the consistency of independence tests based on TIC and TIC e established, we turn now 
to empirical evaluation of the power of independence testing based on TIC e as computed using 
the EquicharClump algorithm. 

To evaluate the power of TIC e -based tests, we reproduced the analysis performed in [32]. 
Namely, for the set of functions F chosen by Simon and Tibshirani, we considered the set of 
relationships they analyzed: 

Q= {(X, f(X) + e') : X ~ Unif, / G F, e' ~ AA(0, a 2 ),a£ M> 0 } . 

For each relationship Z in this set that we examined, we simulated a null hypothesis of 
independence with the same marginal distributions, and generated 1, 000 independent samples, 
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each with a sample size of n = 500, from both Z and from the null distribution. These were 
used to estimate the power of the size-a right-tailed independence test based on each statistic 
being evaluated. Following Simon and Tibshirani, we compared TIC e to the distance correlation 
[10, 28], the original maximal information coefficient [2] as approximated using APPROX-MIC, 
and to the Pearson correlation. (Though it is not a measure of dependence, the Pearson cor¬ 
relation was presumably included by Simon and Tibshirani as an intuitive benchmark for what 
is achievable under a linear model.) We also compared to MIC e using identical parameters to 
those of TIC e to examine whether the summation performed by TIC e is better than maximiza¬ 
tion when all other things are equal. Note that we do not compare to methods of analyzing 
contingency tables, such as Pearson’s chi-squared test. This is because our data are real-valued 
rather than discrete, and so contingency-based methods are not applicable. However, when data 
are discrete, those methods can be very well powered. 

The results of our analysis are presented in Figure 5. First, the figure shows that TIC e 
compares quite favorably with distance correlation, a method considered to have state-of-the- 
art power [32], Specifically, TIC e uniformly outperforms distance correlation on 5 of the 8 
relationship types examined, and performs comparably to it on the other three relationship 
types. Distance correlation has many advantages over TIC e , including the fact that it easily 
generalizes to higher-dimensional relationships. However, in the bivariate setting TIC e appears 
to perform as well as distance correlation if not better in terms of statistical power against 
independence. 

The analysis also shows that TIC e outperforms the original maximal information coefficient 
by a very large margin, and outperformed MIC e as well, supporting the intuition that the 
summation performed by the former can indeed lead to substantial gains in power against 
independence over the maximization performed by the latter. (We note that in both Simon and 
Tibshirani’s analysis and in this one, the original maximal information coefficient was run with 
default parameters that were optimized for equitability rather than power against independence. 
When run with different parameters, its power improves substantially, though it still does not 
match the power of MIC e . See Appendix Figure 12 and the discussion in [3].) 

Our companion paper [3] expands on this analysis, conducting in-depth evaluation of the 
the power against independence of the tests described above as well as tests based on mutual 
information estimation [29], maximal correlation estimation [8], HSIC [11, 30], HHG [14], DDP 
[31], and RDC [13]. These analyses consider a range of sample sizes and parameter settings, as 
well as a variety of ways of quantifying power across different alternative hypothesis relation¬ 
ship types and noise levels. They conclude that in most settings TIC e either outperforms all 
the methods tested or performs comparably to the best ones. Appendix Figure 12 contains a 
reproduction of one detailed set of power curves from the main analysis in that paper for the 
reader’s reference. 

6 Conclusion 

As high-dimensional data sets become increasingly common, data exploration requires not only 
statistics that can accurately detect a large number of non-trivial relationships in a data set, but 
also ones that can identify a smaller number of strongest relationships. The former property is 
achieved by measures of dependence that yield independence tests with high power; the latter 
is achieved by measures of dependence that are equitable with respect to some measure of 
relationship strength. In this paper, we introduced two related measures of dependence that 
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Figure 5: Comparison of power of independence testing based on TIC e (blue) to MIC with 
default parameters (gray), MIC e with the same parameters as TIC e (black), distance correlation 
(purple), and the Pearson correlation coefficient (green) across several alternative hypothesis 
relationship types chosen by [32]. The relationships analyzed are described in Section 5.2. 
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achieve these two goals, respectively, through the following theoretical contributions. 

• A new population measure of dependence, MIC*, that we proved can be viewed in three 
different ways: as the population value of the maximal information coefficient (MIC) from 
[2], as a “minimal smoothing” of mutual information that makes it uniformly continuous, 
or as the supremum of an infinite sequence defined in terms of optimal partitions of one 
marginal at a time of a given joint distribution. 

• An efficient algorithm for approximating the MIC* of a given joint distribution. 

• A statistic MIC e that is a consistent estimator of MIC*, is efficiently computable, and has 
good equitability with respect to R 2 both on a manually chosen set of noisy functional 
relationships as well as on randomly chosen noisy functional relationships. 

• The total information coefficient (TIC e ), a statistic that arises as a trivial side-product of 
the computation of MIC e and yields a consistent and powerful independence test. 

Though we presented here some empirical results for MIC*, MIC e , and TIC e , our focus 
was on theoretical considerations; the performance of these methods is analyzed in detail in our 
companion paper [3]. That paper shows that on a large set of noisy functional relationships with 
varying noise and sampling properties, the asymptotic equitability with respect to R? of MIC* 
is quite high and the equitability with respect to R 2 of MIC e is state-of-the-art. It also shows 
that the power of the independence test based on TIC e is state-of-the-art across a wide variety 
of dependent alternative hypotheses. Finally, it demonstrates that the algorithms presented 
here allow for MIC e and TIC e to be computed simultaneously very quickly, enabling analysis of 
extremely large data sets using both statistics together. 

Our contributions are of both theoretical and practical importance for several reasons. First, 
our characterization of MIC* as the large-sample limit of MIC sheds light on the latter statistic. 
For example, while MIC is parametrized, MIC* is not. Knowing that MIC converges in proba¬ 
bility to MIC* tells us that this parametrization is statistical only: it controls the bias/variance 
properties of the statistic, but not its asymptotic behavior. 

Second, the normalization in the definition of MIC, while empirically seen to yield good per¬ 
formance, had previously not been theoretically understood. Our result that this normalization 
is the minimal smoothing necessary to make mutual information uniformly continuous provides 
for the first time a lens through which the normalization is canonical. In doing so, it constitutes 
an initial step toward understanding the role of the normalization in the performance of MIC* 
and MIC. The uniform continuity of MIC* and the lack of continuity of ordinary mutual infor¬ 
mation also suggest that estimation of the former may be easier in some sense than estimation 
of the latter. This is consonant with the result concerning difficulty of estimation of mutual 
information shown in [16]. It is also borne out empirically by the substantial finite-sample bias 
and variance observed in [3] of the Kraskov mutual information estimator [29] compared to 
MIC e . 

Third, our alternate characterization of MIC* in terms of one-dimensional optimization over 
partitions rather than two-dimensional optimization over grids enhances our understanding of 
how to efficiently compute it in the large-sample limit and estimate it from finite samples using 
MIC e . This is a significant improvement over the previous state of affairs, in which the statistic 
MIC could only be approximated heuristically, and orders of magnitude slower than the results 
in this paper now allow. 
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Finally, the introduction of the total information coefficient provides evidence that the basic 
approach of considering the set of normalized mutual information values achievable by apply¬ 
ing different grids to a joint distribution is of fundamental value in characterizing dependence. 
Interestingly, a statistic introduced in [31] follows a similar approach by considering the (non- 
normalized) sum of the mutual information values achieved by all possible finite grids. Con¬ 
sistent with our demonstration here that an aggregative grid-based approach works well, that 
statistic also achieves excellent power. (TIC e is compared to the statistic from [31] in our 
companion paper, [3].) 

Taken together, our results point to joint use of the statistics MIC e and TIC e as a theoret¬ 
ically grounded, computationally efficient, and highly practical approach to data exploration. 
Specifically, since the two statistics can be computed simultaneously with little extra cost be¬ 
yond that of computing either individually, we propose computing both of them on all variable 
pairs in a data set, using TIC e to filter out non-significant associations, and then using MIC e 
to rank the remaining variable pairs. Such a strategy would have the advantage of leveraging 
the state-of-the-art power of TIC e to substantially reduce the multiple-testing burden on MIC e , 
while utilizing the latter statistic’s state-of-the-art equitability to effectively rank relationships 
for follow-up by the practitioner. 

Of course our results, while useful, nevertheless have limitations that warrant exploration 
in future work. First, for a sample D from the distribution of some random (X, Y), all of the 
sample quantities we define here use the naive estimate I(D\g) of the quantity I((X,Y)\g) for 
various grids G. There is a long and fruitful line of work on more sophisticated estimators of 
the discrete mutual information [9] whose use instead of I(D\g) could improve the statistics 
introduced here. Second, our approach to approximating the MIC* of a given joint density 
consists of computing a finite subset of an infinite set whose supremum we seek to calculate. 
However, the choice of how large a finite set we should compute in order to approximate the 
supremum to a given precision remains heuristic. Finally, though empirical characterization of 
the equitability of MIC e on representative sets of relationships is important and promising, we 
are still missing a theoretical characterization of its equitability in the large-sample limit. A clear 
theoretical demarcation of the set of relationships on which MIC* achieves good equitability with 
respect to R 2 , and an understanding of why that is, would greatly advance our understanding 
of both MIC* and equitability. 
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A Proof of Theorem 1 

This appendix is devoted to proving Theorem 1, restated below. 

Theorem Let f : m°° —> M be uniformly continuous, and assume that f o r j —> f pointwise. 
Then for every random variable (X,Y), we have 

(/or B(n) ) (m(A0) f(M(X, Y)) 
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in probability where D n is a sample of size n from the distribution of (X,Y), provided w(l) < 
B{n ) < 0(n 1_£ ) for some e > 0. 

We prove the theorem by a sequence of lemmas that build on each other to bound the bias of 
I*(D : k,£). The general strategy is to capture the dependencies between different k-by-£ grids 
G by considering a “master grid” T that contains many more than k£ cells. Given this master 
grid, we first bound the difference between I(D\q) and I((X,Y) | G ) only for sub-grids G of T. 
The bound is in terms of the difference between D |r and (X,Y’)|r. We then show that this 
bound can be extended without too much loss to all k-by-£ grids. This gives what we seek, 
because then the difference between I(D\q ) and I((X, T)| G ) is uniformly bounded for all grids 
G in terms of the same random variable: D |r- Once this is done, standard arguments give the 
consistency we seek. 

In our argument we occasionally require technical facts about entropy and mutual informa¬ 
tion that are self-contained and unrelated to the central ideas. These lemmas are consolidated 
in Appendix L. 

We begin by using one of these technical lemmas to prove a bound on the difference between 
I{D\ g ) and I((X,Y)\ g ) that is uniform over all grids G that are sub-grids of a much denser 
grid T. The common structure imposed by T will allow us to capture the dependence between 
the quantities \I(D\g ) — I((X, F)| G )| for different grids G. 

Lemma A.l. Let II = (n_Y,IIy) and T = (Tyq'I'y) be random variables distributed over the 
cells of a grid T, and let ( 7 Tjj) and (ipij) be their respective distributions. Define 

7/i j 7Tjj 

£i,j = • 

Let G be a sub-grid of T with B cells. Then for every fixed 0 < a < 1 we have 


|/(T| G )-/(n| G )|<o 


(log 5) £ I 


'Ml 


M 


when |£jj < 1 — a for all i and j. 

Proof Let P = II | G and Q = \H| G be the random variables induced by II and T respectively on 
the cells of G. Using the fact that I(X, Y) = H(X) + H{Y) — H(X, T), we write 


I HQ) - I(P) I < WQx) - H(P X ) I + I H(Qy) - H(Py) I + | H(Q) - H(P)\ 


where Qx and P x denote the marginal distributions on the columns of G and Qy and Py denote 
the marginal distributions on the rows. We can bound each of the terms on the right-hand side 
of the equation above using a Taylor expansion argument given in Lemma L.l, whose proof is 
found in the appendix. Doing so gives 


I I(Q) - HP )I < (ini?) £ 0 (M) + E 0 (l e *>jl) + E 0 (Kil) 


hi 


where 


£■%,* — 


~ *ij) 
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and e*j is defined analogously. 

To obtain the result, we observe that 




Yj KijEij 


Yj 


< Yj^ij \ s i,j\ < 

YjHi - V 


i 


since 7T 'i,j/Yj 7 r i,j — 1- an d th e analogous bound holds for \ e * j \- □ 

We now extend Lemma A.l to all grids with B cells rather than just those that are sub-grids 
of the master grid T. The proof of this lemma relies on an information-theoretic result proven 
in Appendix B that bounds the difference in mutual information between two distributions that 
can be obtained from each other by moving a small amount of probability mass. 

Lemma A.2. Let II = (IIjs^, II>') and T = (Tx, 'Ly) be random variables, and let T be a grid. 
Define £jj on II|r and T|p as in Lemma A.l. Let G be any grid with B cells, and let 5 (resp. d ) 
represent the total probability mass of n|p (resp. T|r) falling in cells of T that are not contained 
in individual cells of G. We have that 


|/(T| g ) - /(II| G )| < O 


X) \ £ i,i\ + <5 + d ] log B + (51og(l/<5) + d\og{l/d) 




provided that the |£»,j | are bounded away from 1 and that d, 5 < 1/2. 

Proof. In the proof below, we use the convention that for any two grids G and G' and any 
random variable Z. the expression A Z (G,G') denotes \I(Z\c) — I(Z | G /)|. 

Consider the grid G' obtained by replacing every horizontal or vertical line in G that is not 
in r with a closest line in T. The grid G' is clearly a sub-grid of T. Moreover, II| G / (resp. T| G /) 
can be obtained from II| G (resp. II| G ) by moving at most 5 (resp. d) probability mass. This 
can be shown to imply that 

A n (G, G') < O (<Jlog(l/<5) + dlogB) and A^(G', G) < O (dlog(l/d) + dlogB). 


The proof of this information-theoretic fact is self-contained and so we defer it to Proposition 7 
in Appendix B, as it is more central to the arguments presented there. 

With A $ (G, G') and A ^{G', G) bounded in terms of 6 and d, we can bound |/(T| G )—/(<h| G )| 
using the triangle inequality by comparing it with 

A n (G, G') + I / (n| G /) - I (T| g 0| + A' J, (G / , G) 

and bounding the middle term using Lemma A.l, since G' C T. □ 

We now use the fact that the variables £* -j dehned in Lemma A.l are small with high 
probability to give a concrete bound on the bias of I(D\q) that is uniform over all k-by-£ 
grids G and that holds with high probability. It is useful at this point to recall that, given a 
distribution (X, Y), an equipartition of (X,Y) is a grid G such that all the rows of (X, T)| G 
have the same probability mass, and all the columns do as well. 
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Lemma A.3. Let D n be a sample of size n from the distribution of a pair (X. Y) of jointly 
distributed random variables. For any a > 0, any e > 0, and any integers k,£ > 1, we have that 
for all n 


\I(D n \ G ) - I((X,Y)\ g )\ < O 


/log (k£) log (k£n)\ 
Vc(n)“ + n £ / 4 ) 


for every k- by-/ grid G with probability at least 1 — C(ri)e r2 ( n /C( n ) 1+2a ) ) where C(ri) = k£n £ / 2 . 


Proof. Fix n, and let F be an equipartition of (X. Y) into kn e ' 4 rows and Zn £ / 4 columns. C(n) 
is now the number of cells in F. Lemma A.2, with II = (X,Y) and = D, shows that 
|/(Z)| G ) - I((X,Y)\ g )\ is at most 



provided the ^ have absolute value bounded away from 1, and provided that d,5 < 1/2. 

The remainder of the proof proceeds as follows. We first show that the £ l 3 are small with 
high probability. This will both show that the lemma’s requirement on the Sij holds and allow 
us to bound the sum in the inequality above. We will then use our bound on the cjj to bound 
d in terms of 5. Finally, we will bound 6 using the fact that the number of rows and columns 
in r increases with n. This will give us that d, 6 < 1/2 and allow us to bound the rest of the 
terms in the expression above. 

Bounding the £ t .j: We bound the £,;j using a multiplicative Chernoff bound. Let TT t j and 
ipij represent the probability mass functions of (X, T)|r an d D |r respectively. We write 

P(kiji><5) = P (7Tij(l - 8 ) < < 7Tjj(l + (5)) 

< e -n(n7Ti,j(5 2 ) 


since 'tpi.j is a sum of n i.i.d Bernoulli random variables and E {ipi.j) = n7T i,j- (See, e.g., [33].) 
Setting 5 = v /7fjj/C'(n) 1//2+Q yields 


P 



C(n)V2+« 


< e -n(B/C(n) 1 ^)_ 


A union bound over the pairs (i, j) then gives that, with the desired probability, the above 
bound on |ejj| holds for all i , j. 

Bounding ^ \£i,j\: The bound on the £ij implies that 

^2\ £ hj\ ^ 


< 

< 

where the second line follows from the fact that the function J/ y/ffij is symmetric and concave 
and therefore, when restricted to the hyperplane = 1> must achieve its maximum when 

7 Tij = 1/C(n) for all i,j. 


C(n) 1 / 2 


+a 


1 


C'(n) 1 / 2+a 

1 


E 


hj 


Vcl 


n) 


C(n) 
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Bounding d in terms of 5: We use our bound on the e t j to bound d. We do so by observing 
that it implies 


A,j < Ki,j 1 + 


/7T, 




C{n ) 1 / 2 +“ 


— ^i,j + 


3/2 


C(n ) 1 / 2 


fs TTi.l + 


+a - 


C(n)V2+o 




since 7Tjj < 1 and C(n) > 1. 

The connection to d comes from the fact that for any column j of T, this means that 

rh.j — ^ ^ 'fj <: i 2 ^ ^ 7Tjj — 2tt* . 


This also applies to the sums across rows. Since d is a sum of terms of the form and ^ * 
for j in some index set J and i in an index set /, and <5 is a sum of terms of the form 7r* j and 
TTi jif with the same index sets, we therefore get that d < 26. 

Bounding 6 and obtaining the residt: To bound 6, we observe that because G has at most 
£ — 1 vertical lines and k — 1 horizontal lines, we have 

r ^ £ k 2 
ln £ / A kn £ / A n £ / 4 

This bound on 6 allows us to bound the terms involving d and 6 by 


6 + d < O 




+ dlog 



< O 


log n 

n £ / 4 


Combining all of the bounds gives the desired result. 


□ 


Our final lemma shows that as long as B{n ) doesn’t grow too fast, the bound from the 
previous lemma yields a uniform bound on the entire sample characteristic matrix. This is done 
by specifying an error threshold for which Lemma A.3 yields a bound that holds with high 
probability, and then invoking a union bound. 


Lemma A.4. Let D n be a sample of size n from the distribution of a pair (A, Y) of jointly 
distributed random variables. For every B{n) = O there exists an a > 0 such that for 

sufficiently large n, 


M(D n ) k/ -M{X,Y) k/ 


< O 



holds for all kl < B(n) with probability P(n) = 1 — o(l), where M(D n )k/ is the k,£- th 
entry of the sample characteristic matrix and M(X,Y)k^ is the k,£- th entry of the population 
characteristic matrix of (X, Y). 


Proof. Fix k,£, and any a satisfying 0 < a < e/(4 — 2e). Lemma A.3 implies that with high 
probability the difference \M(D n )f ; ( — Mk/\ is at most 

/ log (k£) log (k£n) \ / logn logn \ 

\C(n) a n £ / 4 J ~ \C(n)“ n £ / 4 / 

z „ / log n logn\ 

- \n a£ / 2 n £ / 4 J 
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where the first inequality comes from k£ < B(n ) and second is because C(n) = ktn £ / 2 > n e / 2 . 
This bound is at most O (1 /n a ) for every a < min{ae/2, e/4}, as desired. It remains only to 
show that the bound holds with high probability across all k£ < B(n). 

Lemma A.3 states that the probability our bound holds for one fixed pair (k,£) is at least 

1 - C(n)e- n ( n/c(n)1+2 “ ) > 1 -0(n) e~ n ^ u) 

for some positive u. This is because C(n) < B(n)n £ / 2 < O (n 1_e / 2 ) for large n, and so our 
choice of a ensures that C(n) 1+2a = O fn 1 ~ u ) for some u > 0. 

We can then perform a union bound over all pairs ki < B(n): since the number of such 
pairs can be bounded by a polynomial in n, we have that the desired condition is satisfied for 
all k£ < B(n) with probability approaching 1. □ 

We are now ready to prove the main result. 

Theorem Let f : m°° —> M be uniformly continuous, and assume that f o r\ —> f pointwise. 
Then for every random variable (X,Y), we have 

(f ° ^ B(n)) (m(D„)) ->f(M(X,Y)) 

in probability where D n is a sample of size n from the distribution of (X,Y), provided w(l) < 
B(n) < 0(n l ~ £ ) for some e > 0. 


Proof. Let N denote B(n), let M]y = and let Mjv(D n ) = r]y(M(D n )). We begin by 

writing 


/ (Mjv(AO) - f(M) 


< 



f (M n ) 
f (Mjy) 


+ | f(M N )-f(M)\ 

+ \(f°r N ) (M) - f(M )| 


and observing that as n —> oo, the second term vanishes by the pointwise convergence of / o r* 
and the fact that B(n ) > w(l). It therefore suffices to show that the first term converges to 0 
in probability. Since / is uniformly continuous, we can establish this via a simple adaptation of 
the continuous mapping theorem, which says that if the sequence of random variables R n —> R 
in probability, and g is continuous, then g(R n ) —> g{R) in probability. We replace R with a 
second sequence, and replace continuity with uniform continuity. 

Let || • || denote the supremum norm on m°°, and fix any z > 0. Then, for any <5 > 0, define 

C s = {A e m°° : 3A' G m°° s.t. ||A - A!\\ < 5, \f{A) - fA') \ > z} . 


This is the set of matrices A G m°° for which it is possible to find, within a h-neighborhood 
of A , a second matrix that / maps to more than z away from f(A). Because / is uniformly 
continuous, there exists a (5* sufficiently small so that Cs * = 0. 

Suppose that \ f{M^{D n )) — /(Mjv)| > z. This means that either ||Mjv(-Dn) — Mjv|| > 5*, 
or Mjy G Cs*. The latter option is impossible since Cs* = 0, and Lemma A.4 tells us that 
P ( \\M N (D n ) — Mn || > 5* ) —> 0 as n grows. We therefore have that 


in probability, as desired. 



f(M N ) 


□ 
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B Proof of Theorem 2 


In this appendix we prove Theorem 2, reproduced below. 

Theorem Let ^(R 2 ) denote the space of random variables supported on R 2 equipped with the 
metric of statistical distance. The map from "P(R 2 ) to m°° defined by (X,Y) i —> M(X,Y) is 
uniformly continuous. 


The proposition below begins our argument with the simple observation that the family of 
maps consisting of applying any finite grid to some (A", Y) E "P(R 2 ) is uniformly equicontinuous. 
The reason this holds is that (A, T)|g is a deterministic function of (A ,Y), and deterministic 
functions cannot increase statistical distance. 

Proposition 6. Let G be the set of all finite grids. The family {(A, Y) i—>• (A, Y)\g : G E G} 
is uniformly equicontinuous on ^(R 2 ). 

Proof. To establish uniform equicontinuity, we need to show that, given some (A, Y) E ^(R 2 ) 
and some e > 0, we can choose 5 to satisfy the continuity condition in a way that does not 
depend on G or on (A, Y). But because deterministic functions cannot increase statistical 
distance, we have that if (A, Y), (A' 7 , Y') E 'P are at most e apart then 

A ((A, T)|g, (A 7 , Y')\g) < A ((A, Y), (A 7 , T 7 )) = e 

where A denotes statistical distance. Choosing 5 = e therefore gives the result. □ 


At this point it is tempting to try to use continuity properties of discrete mutual informa¬ 
tion to obtain uniform continuity of the characteristic matrix. And indeed, this strategy does 
yield that each individual entry of the characteristic matrix is a uniformly continuous func¬ 
tion. However, to obtain continuity of the entire (infinite) characteristic matrix we need to 
make a statement about all grid resolutions simultaneously. This is not straightforward because 
mutual information is only uniformly continuous for a fixed grid resolution, and the family 
{(A, Y) H > /((A, T)|g) : G E G} is in fact not even equicontinuous. 

The normalization in the definition of MIC* is what allows us to establish the uniform 
continuity of the characteristic matrix despite this problem. To see why, suppose we have a 
distribution over a k-by-£ grid and we are allowed to move at most 6 away in statistical distance 
for some small 5. The largest change in discrete mutual information that this can cause indeed 
increases as we increase k and £. However, it turns out that we can bound the extent of this “non¬ 
uniformity”: the proposition below shows that as we move away from a distribution, the discrete 
mutual information can change only proportionally to the amount of mass we move, with the 
proportionality constant bounded by logminjA;, £}. Because logmin{/c,£} is the quantity by 
which we regularize the entries of the characteristic matrix, this is exactly enough to make the 
normalized matrix continuous. This proposition is the technical heart of our continuity result. 
And as we show in Corollary 3.3 when we demonstrate the non-continuity of the non-normalized 
characteristic matrix mutual information, our bound is tight. 


Proposition 7. Let 1^^ : P({1,..., k} X {1,... ,£}) —> R denote the discrete mutual information 
function on k-by-£ grids. For 0 < 5 < 1/4, the maximal change in 1^/ over any subset of 
"P({1,..., k} x {1,..., £}) of diameter 5 (in statistical distance) is 


O ( dlog ( - ) + <51ogmin{A:, £} 
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Proof. Without loss of generality, assume k < £, so that logminjfc,^} = log A:. Let (X,Y) and 
(X 1 , Y') be two random variables distributed over {1,..., k} X {1,..., £} that are at most 5 
apart in statistical distance. Using I(X,Y) = H(Y ) — H(Y\X), we can express the difference 
between the mutual information of these two pairs of random variables as 

| I{X,Y)-I{X',Y')\ < | H(Y) - H(Y')\ + \H(Y\X) - H(Y'\X')\. 

We now use Lemma L.5, which relates movement of probability mass to changes in entropy 
and is proven in Appendix L, to separately bound each of the terms on the right hand side. 
Straightforward application of the lemma to \H(Y) — H(Y')\ shows that it is at most 2H b (26) + 
34ogfc, where H b (-) is the binary entropy function. Since H b (x) < 0{x\og{\/x)) for x small, 
this is 0(dlog(l/<5) + <51ogA;). 

Bounding the term with the conditional entropies is more involved. Let p x = P ( X = x), 
and let p' x = P ( X' = x). We have 


\H(Y\X) — H(Y'\X')\ = 

< 


< 


Y \PxHiY\X = x)-p' x H(Y'\X' = x)\ 

X 

(p*\ H ( Y \ X = x ) - H(Y'\X' = x)\ + \p' x -p x I H(Y'\X' = x)) 

X 

\H(Y\X = x)~ H(Y'\X' = x)\ +Y, W x -Px | log k 

X X 

^2p x \H(Y\X = x)-H(Y'\X' = s)| +51ogfc (1) 


where the last line is because \p x — p' x \ < 5 and H(Y'\X' = x) < log A:. 

Now let 4+ be the magnitude of all the probability mass entering any cell in column x, 
let 5 X - be the magnitude of all the probability mass leaving any cell in column x, and let 
S x = 4+ + 4—• Using this notation, we can again apply Lemma L.5 to obtain 


Y, P *\ H{ y\x 


x)-H(Y'\X' = x)\ < 


< 

< 


Y, Px ( 2 H b 


2 Y PxHb 


2 Y,P* H b( 

nr. Y 


(24 

V Px 

24 \ 

Px ) 

24A 

Px ) 


^ + 3^- log A:^ 
+ 3 Y 4 log k 

X 

+ 36 log k 


2H b (25) + 35 log k 


where the last line is by application of Lemma L.2 from the appendix, which bounds weighted 
sums of binary entropies. 

Combining this with Line (1) gives that 


\H(Y\X) - H(Y'\X')\ < 2H b {25) + 45log k 

which, together with the bound on | H(Y) — H{Y')\ and the fact that H b (X) < 0(x log(l/x)) 
for x small, gives the result. □ 
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Having bounded the extent to which variation in mutual information depends on grid reso¬ 
lution, we are now ready to show the uniform continuity of the characteristic matrix. 

Theorem Let ^(M 2 ) denote the space of random variables supported on M 2 equipped with the 
metric of statistical distance. The map from ^(M 2 ) to m°° defined by (X,Y) H > M(X,Y ') is 
uniformly continuous. 


Proof. We complete the proof in three steps. First, we show that a certain family of functions 
F is uniformly equicontinuous. Second, we use this to show that a different family F' consisting 
of functions of the form sup ffgj 4 g with A C F is uniformly equicontinuous. Finally, we argue 
that since the entries of M(X,Y) consist of the functions in F' , this is sufficient to establish 
the result. 

Define 


F = 


{X,Y)^ 


I k A(X,Y) | G ) 

log min{/c, £} 


k,£ E Z>i,Ge G(k,£) 


F is uniformly equicontinuous by the following argument. Given some e > 0, we know (Propo¬ 
sition 6) that for any {X',Y') in an e-ball around (X,Y), (X',Y')\q will remain e of (X,T)|(j 
for any G. Proposition 7 then tells us that if e is sufficiently small then the distance between 
I k /((X',Y')\ g ) and I k>e ((X,Y) | G ) will be at most 


O (elog(l/e) + elogminjfc,^}). 


After the normalization, this becomes at most 0(e(log(l/e) + 1)), which goes to 0 (uniformly 
with respect to (X, T)) as e approaches 0, as desired. 

Next, define 

F’ = {(X,T) ^ M(X,Y) k/ : k,£ £ Z M } . 

Each map in F' is of the form sup^g^g for some A C F. Therefore, for a given e > 0, whatever 
5 establishes the uniform equicontinuity for F can be used to establish continuity of all the 
functions in F 1 . (To see this: sup ffgj 4 g can’t increase by more than e if no g increases by more 
than e, and sup^g^g is also lower bounded by any of the g 1 s, so it can’t decrease by more than 
e either.) Since we can use the same 5 for all of the maps in F' , they therefore form a uniformly 
equicontinuous family. 

Finally, the 5 provided by the uniform equicontinuity of F' also ensures that M(X',Y') is 
within e of M (X, Y) in the supremum norm, thus giving the uniform continuity of (X, Y) i—>• 
M(X,y). □ 


C Proof of Proposition 2 


Theorem 

N, i.e., 


For some function N(k,£), let M N be the characteristic matrix with normalization 


M n (X, Y) 


j*((x,y),M) 

N(k ,£) 


If N(k,£) = o(log min{/c, £}) along some infinite path in N X N, then M N and sup M N are not 
continuous as functions ofV([ 0,1] X [0,1]) C ^(M 2 ). 
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Proof. Consider a random variable Z uniformly distributed on [0,1/2] 2 . Because Z exhibits 
statistical independence, I*(Z , k, £) is zero for all k, £. Now define Z £ to be uniformly distributed 
on [0,1/2] 2 with probability 1 —e and uniformly distributed on the line from (1/2,1/2) to (1,1) 
with probability e. 

We lower-bound I*(Z e ,k,£). Without loss of generality suppose that k < £, and consider a 
grid that places all of [0,1/2] 2 into one cell and uniformly partitions the set [1/2, l] 2 into k — 1 
rows and k — 1 columns. By considering just the rows/columns in the set [1/2, l] 2 we see that 
this grid gives a mutual information of at least elog (k — 1). Thus, we have that for all k,£, 

I*(Z e ,k,£ ) > elogmin{/c — 1,£ — 1}. 

This implies that the limit of M N (Z £ ) along P is oo, and so the distance between M N (Z) 
and M N (Z e ) in the supremum norm is infinite. □ 

D Proof of Theorem 4 


Theorem Let M be a popidation characteristic matrix. Then M k ^ equals 

I(X,Y\p) 


max 

PeP(fc) 


log k 


where P{k ) denotes the set of all partitions of size at most k. 


Proof. Define 


MY = max 
PeP(fc) 


I(X,Y\ P ) 
log k 


We wish to show that MY is in fact equal to M k ^. To show that < Mf *, we observe 

that for every k-by-£ grid G = (P, Q), where P is a partition into rows and Q is a partition into 
columns, the data processing inequality gives I((X,Y)\q) < I(X,Y\p). Thus Mf.j < MY for 
£ > k, implying that 

M fcit = lim M k l < MY. 

^—>■00 ’ 

It remains to show that Mf^. < M k ^. To do this, we let P be any partition into k rows, and 
we define Qi to be an equipartition into £ columns. We let 


Mk,i,P ~ 


I(X\ Ql ,Y\ P ) ' 

log k 


Since Mf e p < M k / when £ > k, we have that for all P 


which gives that 


as desired. 


I(X,Y\ P ) 
log k 


lim M k £ P < lim M k n 


M k , t 


M fc,t = SU P 

p 


I{X, Y\p) 

log k 


< M k ^ 


□ 
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E Proof of Theorem 5 


Theorem Given a random variable (A, Y), (resp. ) is computable to within an 

additive error of 0{ke\og{\/{ke))) + E (resp. 0(le\og{l/(le))) +E) in time 0(kT(E)/e) (resp. 
0(£T(E)/e)), where T(E) is the time required to numerically compute the mutual information 
of a continuous distribution to within an additive error of E. 

Proof. Without loss of generality we prove the claim only for Given 0 < e < 1, we 

would like a partition into rows P of size at most k such that /(A, Y\p) is maximized. We 
would like to use OPTIMIZEXAxiS for this purpose, but while our search problem is continuous, 
OPTIMIZEXAxiS can only perform a discrete search over sub-partitions of some master partition 
II. We therefore set II to be an equipartition into 1/e rows and show that this gets us close 
enough to achieve the desired result. 

With II as described, the OptimizeXAxis provides in time 0(kT(E) /e) a partition Pq into 
at most k rows such that I (X,Y\p 0 ) is maximized, subject to Po C II, to within an additive 
error of E. To prove the claim then, we must show that the loss we incur by restricting to 
sub-partitions of II costs us at most 0(ke log(l/(fce))). In other words, we must show that 

I(X,Y\ P )-I(X,Y\ P0 )<O(ke) 

where P is an optimal partition into rows. Note that we have omitted the absolute value above, 
since by the optimality of P, I ( X , Y\p ) > I ( X , T|p 0 ) always. 

We prove the desired bound by showing that there exists some P' C II such that the mutual 
information of (A, T|p') is 0(fcelog(l/(fee)))-close to that achieved with (A, T|p). Since P' C II 
gives us that I (A, X|p 0 ) > I (A, T|p/), we may then conclude that I (A, Y\p) — I (A, X|p 0 ) is 
at most 0(ke log(l/(A:e))). 

We construct P' by simply moving replacing every horizontal line in P with the horizontal 
line in II closest to it. Since there are at most k — 1 horizontal lines in P, and each such line 
is contained in a row of II containing 1/e probability mass, performing this operation moves 
at most ( k — l)e probability mass. In other words, the statistical distance between (A, Y\p/) 
and (A, T|p) is at most (k — l)e < ke. Thus, for sufficiently small e, Proposition 7, proven in 
Appendix B, can be used to show that 

ke log ( ) + ke log 

\ke) 

which yields the desired result. □ 

Remark. We do not explore here the details of the numerical integration associated with the 
above theorem, since the error introduced by the numerical integration is independent of the 
algorithm being proposed. However, standard numerical integration methods can be used to 
make this error arbitrarily small with an understood complexity tradeoff (see, e.g., [34]). 

F Proof of Theorem 6 



\I(X,Y\p,)-I(X,Y\ P )\<0 


Theorem Let (A, Y) be jointly distributed random variables. Then d[M] = dM. 
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Proof. Without loss of generality, we show that [M]^ = M*. | . Fix any partition into rows P. 
If Qi is an equipartition into l columns then 


lirn I(X\ Qe ,Y\ P ) = I(X,Y\ P ), 
£—>■00 


because the continuous mutual information equals the limit of the discrete mutual information 
with increasingly fine partitions. (See, e.g., Chapter 8 of [22] for a proof of this.) This means 
that, letting P(k) denote the set of all partitions of size at most k, we have 


[M]k, t 


max 

PeP(k) 


HX,Y\p) 

log k 


-Mfc.t 


where the second equality follows from Proposition 4. 


□ 


G Consistency of MIC e in estimating MIC* 


The consistency of MIC e for estimating MIC* can be established using the same technical 
lemmas that we used to show that MIC —> MIC*. Specifically, we can use Lemma A.3, which 
bounds the difference, for all k-by-l grids G, between the sample quantity I{D n \o) and the 
population quantity I((X,Y)\g) with high probability, where D n is a sample of size n from 
(. X , Y). That lemma yields the following fact about the sample equicharacteristic matrix, whose 
proof is similar to that of Lemma A.4. 


Lemma G.l. Let D n be a sample of size n from the distribution of a pair (A - , Y) of jointly 
distributed random variables. For every B(n ) = O (n 1_£ ), there exists an a > 0 such that for 
sufficiently large n, 


[M](D n ) k/ -[M](X,Y) k/ 


< O 


holds for all kl < B(ri) with probability P{ri) = 1 — o(l), where [M](D n )^ £ is the k ,£-th entry 
of the sample equicharacteristic matrix and [M\(X,Y)k/ is the k ,£-th entry of the population 
equicharacteristic matrix of (X, Y). 

In the case of MIC, we proceeded to apply abstract continuity considerations to obtain 
our consistency theorem (Theorem 1) from a result analogous to the above lemma. A similar 
argument shows us that, in the case of the equicharacteristic matrix as well, we can estimate a 
large class of functions of the matrix in the same way. This is stated formally in the theorem 
below. As before, we let m°° be the space of infinite matrices equipped with the supremum 
norm, and given a matrix A the projection n zeros out all the entries A^i for which k£ > i. 


Theorem Let f : m°° —> M be uniformly continuous, and assume that f o r j —> f pointwise. 
Then for every random variable (X,Y), we have 


(/»%)) ([MJ(AO) f([M](X,Y)) 


in probability where D n is a sample of size n from the distribution of (X,Y), provided cc(l) < 
B(n) < 0(n 1 ~ £ ) for some s > 0. 
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H The EquicharClump algorithm 


In Theorem 9, we sketched an algorithm called EQUICHARCLUMP for approximating the sample 
equicharacteristic matrix that is more efficient than the naive computation. In this appendix, we 
describe the algorithm in detail, bound its runtime, and show that it indeed yields a consistent 
estimator of MIC* from finite samples as well as a consistent independence test when used to 
compute the total information coefficient. We then present some empirical results characterizing 
the sensitivity of the algorithm to its speed-versus-optimality parameter c. 

The results in this section can be summarized as follows: let ( X , Y) be a pair of jointly 
distributed random variables, and let D n be a sample of size n from the distribution of (A, Y ). 
For every c > 1, there exists a matrix {M} c (D n ) such that 

1. There exists an algorithm EQUICHARCLUMP for computing re({M} c (.D n )) in time 0(n + 
R 5 / 2 ), which equals 0{n + n 5 ^ 1- ^/ 2 ) when B(n) = 0(n 1_£ ). 

2. The function 

MIC e)B (-) = max {M} c (-)m 

k£<B(n) 

is a consistent estimator of MIC* provided w(l) < B(n) < 0(n 1_£ ) for some e > 0. 

3. The function 

TIC e , s (-)= 

k£<B(n) 

yields a consistent right-tailed test of independence provided w(l) < B(n ) < O (n 1 ^ 6 ) for 
some e > 0 

We will prove these results in order. 


H.l Algorithm description and analysis of runtime 

We begin by describing the algorithm and bound its runtime simultaneously. As in the proof of 
Theorem 8, we bound the runtime required to approximately compute only the k, £-th entries 
of { M} c (D n ) satisfying k < £, ki < B. To do this, we analyze two portions of { M} c (D n ) sepa¬ 
rately: we first consider the case l > y/B, in which we must compute the entries corresponding 
to all the pairs {(2,£),..., (B/£,£)}. We then consider £ < y/B, in which case we need only 
compute the entries {(2,£), ..., (£,£)} since the additional pairs would all have k > l. 

For the case of l > y/B, as in the previous theorem we can simultaneously compute us¬ 
ing OPTIMIZEXAxiS the entries corresponding to all the pairs {(2 ,£),... ,(B/£,£)} in time 
0(|n| 2 (R/^) = 0(|n| 2 R), which equals 0(c 2 B 3 /l 2 ) when we set II to be an equipartition of 
size cB/£. Doing this for l = y[B,... ,B/2 gives a contribution of the following order to the 
runtime. 


°^ b3 ) E ^ = o(Ab 3 )o(W 

i=Vb v 

= 0(c 2 B 5 / 2 ) 


For the case of £ < y/B, we can simultaneously compute using OptimizeXAxis the en¬ 
tries corresponding to all the pairs {(2,£),..., (£,£)} in time 0(|II| 2 ^ 2 ) which equals 0(c 2 £ 4 ) < 
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0(c 2 B 2 ) when we set II to be an equipartition of size c£. Summing over the 0(y/B) possible 
values of £ with £ < \[B gives an upper bound of 0(c 2 B 5//2 ). 


H.2 Consistency 


Let (X, Y) be a pair of jointly distributed random variables. For a sample D n of size n from 
the distribution of (X,Y) and a speed-versus-optimality parameter c > 1, let { M} c (D n ) denote 
the matrix computed by EquicharClump. (Notice the use of curly braces to differentiate this 
from the sample equicharacteristic matrix [M].) We show here that rna ^- k £<B(n){^} c (P > n)k,e is 
a consistent estimator of MIC*(X, T), and correspondingly that '^2ki<B(n){M} C (Dn)k,e yields 
a consistent independence test. 

The key to both consistency results is that, though in calculating the k, £-th entry of 
{M} c (D n ) the algorithm only searches for optimal partitions that are sub-partitions of some 
equipartition, the size of the equipartition used always grows as n, k , and £ grow large. There¬ 
fore, in the limit this additional restriction does not hinder the optimization. We present this 
argument by introducing a population object called the clumped equicharacteristic matrix. We 
observe that this matrix is the limit of the EQUICHARCLUMP procedure as sample size grows, 
and then show that the supremum and partial sums of this matrix have the necessary properties. 


Definition H.l. Let (X. Y) be jointly distributed random variables and fix some c > 1. Let 

/I c *}((X,nM) =max/((X,y)| G ) 

Ct 


where the maximum is over k- by-£ grids whose larger partition is an equipartition and whose 
smaller partition must be contained in an equipartition of size c • ma x{k,£}. The clumped 
equicharacteristic matrix of (X,Y), denoted by {M} C (A, Y), is defined by 


{M} c (X,F) m 


i (c * } ((w,y),M) 

logmin{fc,£} 


Notice that curly braces differentiate the quantities I <lC *' s and {M} c defined above from the 
corresponding equicharacteristic matrix quantities /M and [M\. 

The following two results, which we state without proof, characterize the convergence of 
the output of EQUICHARCLUMP to the clumped equicharacteristic matrix. These lemmas can 
be shown using Lemma A.3, which simultaneously bounds the difference, for all k-by-£ grids 
G, between the sample quantity I(D h \q) and the population quantity I((X,Y)\q) with high 
probability over the sample D n of size n from ( X , Y). 

Lemma H.2. Let D n be a sample of size n from the distribution of a pair (X,Y) of jointly 
distributed random variables. For every B(n ) = O (n 1-E ), there exists an a > 0 such that for 
sufficiently large n, 


{M} c (D n ) k/ -{M} c (X,Y) k! e 


< O — 
n° 


holds for all k. £ < y 7 B(n) with probability P{n) = 1 — o(l), where { M} c (D n ) denotes the 
matrix computed by the EQUICHARCLUMP algorithm with parameter c on the sample D n . 


Notice that the error bound provided by the above lemma holds not for k£ < B(ri) as in 
the analogous Lemma A.4 and Lemma G.l, but rather for the smaller region defined by k,£ < 
^jB(n). However, though we do not have uniform convergence outside the region k, £ < \JB(n). 
we do nevertheless have pointwise convergence there, as stated below. 
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Lemma H.3. Fix k,£ > 2. Let D n be a sample of size n from the distribution of a pair (X,Y) 
of jointly distributed random variables. For every B(n) > w(l), we have that 

{M} c (D n ) k/ ^{M} c (X,Y) k/ 

in probability as n grows, where { M} c (D n ) denotes the matrix computed by the EQUICHAR- 
CLUMP algorithm with parameter c on the sample D n . 


H.2.1 Consistency for estimating MIC* 

The consistency of { M} c (D n ) for estimating MIC* follows from the following property of the 
clumped equicharacteristic matrix {M} c , for which we state a proof sketch. 


Proposition 8. Let ( X , Y) be a pair of jointly distributed random variables. Then we have 
sup {M} C {X,Y) = MIG,(X, Y). 


Proof. (Sketch) Let {M} c = {M} C (X, Y), and let M = M(X,Y) be the characteristic matrix. 
Fix k, and consider the limit {M}£ ^ as l grows. The grid chosen for the k, Ath entry when £ > k 
will contain an equipartition Pi of size £ on the x-axis, and a partition Qi of size k on the y-axis 
that is optimal subject to the restriction that Qi be contained in an equipartition of size c£. As 
£ grows large, the equipartition Pi on the first axis will become finer and finer until in the limit 
X\p e — > X. And the partition Qi will be chosen from a finer and finer equipartition, so that 
in the limit it approaches an unconditionally optimal partition Q of size k. The convergence of 
Ql to the optimal partition Q of size k can be shown to be uniform using Proposition 7. This 
implies that 


{ M Yk ,t = 1™ i M Yk,e = max 

" £-s>oo ’ PeP(fc) 


I(X,Y\ P ) 
log k 


where P(k) denotes the set of all partitions of size at most k. Therefore, the boundary d{M} c 
of {M} c equals the boundary dM of M. Since MIC* (A, Y) = sup dM (Theorem 3), this implies 
that 


sup{M} c > sup d{M} c = sup dM = MIC*(A,F). 


On the other hand, {M} c < M element-wise since the optimization for the k, l- th entry of {M} c 
is performed over a subset of the grids searched for the k, £-th entry of M. This means that 
sup {M} c < sup M = MIC* (X, Y). O 


This fact, together with the pointwise convergence of { M} c (D n ) to {M} c , suffices to estab¬ 
lish the consistency we seek via standard continuity arguments, which we give in the abstract 
lemma below. The lemma applies to a double-indexed sequence indexed by i and j\ in our 
argument, the index i corresponds to position in the equicharacteristic matrix, and the index 
j corresponds to sample size. The sequence A corresponds to the output of the EQUICHAR- 
Clump algorithm, the sequence a corresponds to the clumped equicharacteristic matrix, and 
the sequence B corresponds to the sample equicharacteristic matrix. 

Lemma H.4. Let and {Bi^f^- =l be sequences of random variables, and let be 

a non-stochastic sequence. Assume that the following conditions hold. 


1. Aij < Bij almost surely 

2. For every i. Aij —>■ m in probability 
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3. Bj = rnaxj<j B{j satisfies B'- —> sup/a^} in probability 
Then A'- = maxj<j Ajj converges in probability to sup/a*} as well. 

Proof. Let a = sup{aj}. We give the proof for the case that a < oo. However, it is easily 
adapted to the infinite case. We must show that for every e > 0 and every 0 < p < 1, there 
exists some N such that P(| A'j — a| < e) > p for all j > N. By the definition of a, we know 
that there exists some k such that | a k — a | < e/2. Also, by the convergence of A k j to a k , there 
exists some m such that P(| A k j — a k \ < e/2) > 1 — p for all j > m. Thus, with probability at 
least 1 — p, we have 


| Af,. j CLfo | T | CL k CL 

< £ 


for all j > m. 

Next, we observe that since A'j > A k j for j > k , the above inequality implies that for 
j > rna x{m, k} we have P (A'j > a — e) > 1 — p. It remains only to show that A'j doesn’t get 
too large, but this follows from the fact that A'j < Bj and B'j —>• a in probability. Specifically, 
we are guaranteed some N > max{m, k} such that P (B'j <a + e) > 1 — p for j > N. Since 
B'j < a + e implies A) < a + e, we have that P(| A'j — a\ < e) > 1 — p for j > N, as desired. □ 

Proposition 9. The function 


MIC e:B (-) = max {M} c (-) k/ 

k£<B(n) 

is a consistent estimator of MIC* provided w(l) < B(n) < 0(n 1_£ ) for some e > 0, where 
(M} c (-) is the output of the the EquicharClump algorithm. 

Proof. Let ( X , Y) be a pair of jointly distributed random variables, and let D n be a sample of 
size n from the distribution of (X, Y). Let {(fei,£i)}^ 1 C Z + x Z + be a sequence of coordinates 
with the property that for every number B there exists an index q{B) such that 

{(ki,£i) : i < q(B)} = {(k,l) : ki < B} . 

We define B,j = [M](Z)j)fc i] £ i , i.e., B,j is the ki , £j-th entry of the sample characteristic matrix 
evaluated on a sample of size j. We analogously define A % j = {M} c (Dj) k ./ i , and we define 
a, = {M} c (X,Y) ki / i . We observe that by Proposition 8, supaj = sup{M} c (X, Y) = MIC*. 

It is straightforward to see that Aij < Bjj. Additionally, Lemma H.3 shows that Aij —> at 
in probability, and Corollary 4.2, which states that MIC e is a consistent estimator of MIC*, 
shows that Bj = maxj<j Bij -A MIC* (A, Y). In the notation of the lemma, it therefore follows 
that A'j = maxj<j Ajj converges in probability to MIC* (A, Y) as well. But this means that the 
sub-sequence 

A '<!(B(n)) = i<^ n)) = k ™*n) { ^ }C{Dqm))k/ 

converges in probability to MIC* (A, Y), which implies the result since the sequence A'j is mono¬ 
tone. □ 
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H.2.2 Consistency for total information coefficient 

Similarly to the consistency argument for MIC*, we begin by exhibiting the relevant property 
of the population clumped equicharacteristic matrix. 


Proposition 10. Let (X,Y) be a pair of jointly distributed random variables. If X and Y are 
statistically independent, then {M} C (X, Y) = 0. If not, then there exists some a > 0 and some 
integer £q > 2 siLch that 


{■M} c {X,Y) k/ > 


a 

logmin{A:,£} 


either for all k > I > Iq, or for all l > k > £q. 


Proof. (Sketch) Let {M} c = {M} C (X, Y). Under independence, every entry of {M} c is zero 
since I((X,Y)\c) = 0 for any grid G. For the case of dependence, the argument is identical to 
that given in the proof of Proposition 4. Specifically, it can be shown that there exists some 
index £q, taken without loss of generality to be a column index, and some r > 0 such that all but 
finitely many of the entries in the ^o-column are at least r. It can then be shown that for large 
k, the entries (k,£ o), (k,£ o + 1),..., ( k , k ) have non-decreasing values of Y c *\ This establishes 
the claim for a = rlog-fo- □ 


We now show that the above result, together with the uniform convergence of { M} c (D n ) to 
{M} C (X,Y), implies the consistency we seek. 


Proposition 11. The function 

TiCeM-)= E W c (-)m 

k£<B(n) 


yields a consistent right-tailed test of independence provided cu(l) < B{n) < O^n 1 £ ) for some 
e > 0, where {M} c (-) is the output of the the EquicharClump algorithm. 

Proof. Let ( X, Y) a pair of jointly distributed random variables, and let D n be a sample of size 
n from the distribution of ( X , Y). It suffices to show consistency for any deterministic monotonic 
function of the statistic in question. We therefore choose to analyze TIC e ,B(D n )\og(B(n)) / B(n). 

For the null hypothesis in which X and Y are independent, we observe that since {M} c (D n ) < 
[M](D n ) element-wise, 0 < TIC e b{D u ) < TIC e B{D n ) as well. Moreover, the argument given 
in Appendix K, which shows that TIC e B(D n )/B(n) converges to 0 in probability under the 
null hypothesis, can be adapted to show that TlCe^iln) \og{B(n))/B(n) —> 0 as well. Thus, 
TlC e B (D n ) \og{B{n))/B(n) converges to 0 in probability, as required. 

For the case that X and Y are dependent, the proof is analogous to the argument given in 
Appendix K for TIC e . The only difference is that Lemma H.2, which guarantees the uniform 
convergence of { M} c (D n ) to {M} C (X, Y), applies only to the k,£- th entries for which k,£ < 
y/B(n), rather than the entries over which we are summing, which are those for which k£ < 
B(n). However, since we require only a lower bound on TIC e we may neglect these 

entries because 


TIC e , B (T> n )= {M} c (AO m > E WY(Dn)k,t. 

kl<B(n) k,e<y/B(n) 
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It can then be shown, following the argument from Appendix K, that there exists some a > 0 
depending only on B such that, with probability 1 — o(l), 



where = B(n) represents the number of pairs (k,£) such that k,£ < \JB(n). To obtain the 
result, we note that this means that 



B(n) 

and then invoke Proposition 10, which implies that for large n 



n 


H.3 Empirical characterization of the performance of EquicharClump 

The EQUICHARCLUMP algorithm has a parameter c that controls the fineness of the equiparti- 
tion whose sub-partitions are searched over by the algorithm. To gain an empirical understand¬ 
ing of the effect of c on performance, we computed MIC e on the set of relationships described 
in Section 4.4 using EQUICHARCLUMP with different values of c. For each relationship, we 
compared the average MIC e across all 500 independent samples from that relationship with dif¬ 
ferent values of c. We performed this analysis at sample sizes of n = 250 (Figure HI), n = 500 
(Figure H2), and 5,000 (Figure H3). 

We summarize our findings as follows. 

• At low (n = 250) and medium (n = 500) sample sizes, using c = 1 introduces a downward 
bias for more complex relationships when B(n ) = n 0 ' 6 is used but not when B{n) = n 0 ' 8 
is used. This makes sense since the low sample size and low setting of B(n ) mean that 
the algorithm is searching over grids with relatively few cells, and so setting c = 1 hinders 
its ability to find good grids in this limited search space. This bias is almost entirely 
alleviated by setting c > 2. 

• At high sample size (n = 5,000), this effect is still observable but much reduced. This 
makes sense since when n is large, B(n) is large as well, and so the number of cells allowed 
in the grids being searched over is already large regardless of the exponent a used in 
B(n) = n a . Thus, there is less need for the robustness provided by searching for an 
optimal grid. 

I Equitability and power analyses from [3] 

Figure II contains a representative equitability analysis from [3]. Figure 12 contains power 
curves from [3] for a large set of leading methods. 
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Figure HI: The effect of the parameter c on the performance of EquicharClump, at n 
See Section H.3 for details. 


= 250. 
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n = 500 (a = 0.6) 
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Figure H2: The effect of the parameter c on the performance of EquicharClump, at n 
See Section H.3 for details. 


= 500. 
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n = 5000 (a = 0.5) 
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Figure H3: The effect of the parameter c on the performance of EquicharClump, at n 
See Section H.3 for details. 


5,000. 
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Figure II: The equitability of measures of dependence on a set of noisy functional relationships, 
reproduced from [3]. /Narrower is more equitable.] The plots were constructed as in Figure 3. 
Mutual information, estimated using the Kraskov estimator, is represented using the squared 
Linfoot correlation. 
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Figure 12: Power of independence testing using several leading measures of dependence, on 
the relationships chosen by [32], at 50 noise levels with linearly increasing magnitude for each 
relationship and n = 500. To enable comparison of power regimes across relationships, the 
x-axis of each plot lists R 2 rather than noise magnitude. 
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J Equitability analysis of randomly chosen functions with addi¬ 
tional noise model 

Figure J1 contains a version of the main text Figure 4, but where noise has been added only 
to the dependent variable in each functional relationship, rather to both the independent and 
dependent variables. 


K Consistency of independence testing based on TIC e 

Here we prove Propositions 4 and 5 and then use those propositions to prove Theorem 10, which 
shows that TIC e can be used for independence testing. 


K.l Proof of Proposition 4 


Proposition Let ( X , Y) be a pair of jointly distributed random variables. If X and Y are 
statistically independent, then M(X,Y ) = [M] (X, Y) = 0. If not, then there exists some a > 0 
and some integer £q > 2 such that 


M(X,Y) k/ ,[M}(X,Y) k/ > 


a 

logmin{A:,£} 


either for all k > I > £q, or for all £ > k > £ o- 


Proof. We give the proof only for [M] = [M](X, Y), with the understanding that all parts of 
the argument are either identical or similar for M(X, Y). When X and Y are independent, then 
for any grid G, (X,Y)\g exhibits independence as well. Therefore I((X, Y)\g) = 0 for all grids 
G, and so every entry of [M], being a supremum over such quantities, is 0. 

For the case that X and Y are dependent, our strategy is to first find, without loss of 
generality, a column of [M] almost all of whose values are bounded away from zero, and then 
argue that this suffices. 

The dependence of X and Y implies that MIC*(X, Y) > 0. By Corollary 4.1, which states 
that sup d[M ] = MIC*(X, P), we therefore know that there is at least one non-zero element of 
the boundary of [M], as defined in Definition 3.5. Without loss of generality, suppose that this 
element is [M \= lim k->oc\M\ k £ 0 . The fact that this limit is strictly positive implies that 
there exists some fco > £q and some r > 0 such that [M] k ^ 0 > r for all k > k o- That is, all but 
finitely many of the entries in the £q-£\\ column of [M] are at least r. 

We now show that the existence of such a column suffices to prove the claim. Fix some 
k > ko and note that this implies that k > £q. We argue that for all £ in {£q ,...,&}, the 
desired condition holds. Since k > £q, the term /M ((X, Y), k, £q) in the definition of [M] k ^ 0 
is a maximization over grids that have an equipartition of size k on one axis and an optimal 
partition of size £q on the other. Since we allow empty rows/columns in the maximization, 
substituting any £ satisfying £q < £ < k therefore does not constrain the maximization in any 
way and so it cannot decrease /M. In other words, for £ satisfying £q < £ < k, we have 

&\(X,Y),k,£)>&\(X,Y),k,£ 0 ). 
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Figure J1: Equitability of methods examined on functions randomly drawn from a Gaussian 
process distribution, using a different noise model. This figure is identical to Figure 4, but with 
noise added only to the dependent variable in each relationship. Each method is assessed as 
in Figure 4, with a red interval indicating the widest range of R 2 values corresponding to any 
one value of the statistic; the narrower the red interval, the higher the equitability. Sample 
relationships for each Gaussian process bandwidth are shown in the top right with matching 
colors. 


50 
















Since k > £,£q, the normalizations in the definition of [M]^^ and [M]^ £ 0 are \og£ and log^o 
respectively. Therefore, we have that 


[M\k,t > [M] Mo 


log 
log £ 


> 


r log £p 
log£ 


where the last inequality is because k > k$. Setting a = r log^o then gives the result. 


□ 


K.2 Proof of Proposition 5 

Proposition Let {X, Y) be a pair of jointly distributed random variables. If X and Y are 
statistically independent, then Sb(M(X,Y )) = Sb([M](X, Y)) = 0 for all B > 0. If not, then 
Sb{M(X,Y )) and Sb{[M](X,Y)) are both Q(B log log B). 


Proof. We give the argument for M = M(X,Y ) only, but the argument holds as stated for 
[. M\(X,Y ) as well. 

The result follows from the guarantee given by the Proposition 4 above. In the case of 
independence, the proposition tells us that M = 0, which immediately gives that Sb(M) = 0 
for all B > 0. For the case of dependence, the proposition implies that there is some a > 0 and 
some integer £q>2 such that, without loss of generality, M^£ > a/\og£ for all k > £ > £q. We 
convert this into a lower bound on Sb(M). 

The key is to write the sum one column at a time, counting how many entries in each 
column both satisfy k > £ > £q and k£ < B. For any £ satisfying £q < £ < \J~B, the entries 
(£,£), • ■ •, ( B /£,£) meet this criterion, and there are B /£o — {£q — 1) of them. Moreover, since 
the guarantee of Proposition 4 tells us that all of these entries are at least a/\og£, we can 
lower-bound Sb(M) as follows. 



= 14(1? loglog-B) 


where the second-to-last equality is because (£ — l)/logf < £, and the last equality is because 
J2i =io l/(*log i) grows like log log n. □ 

K.3 Proof of Theorem 10 


Theorem The statistics TICb and TIC e b yield consistent right-tailed tests of independence, 
provided o;(l) < B(n) < O (n 1 ^ 2 ) for some e > 0. 
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Proof. We give the proof for TIC only; however, the argument holds as stated for TIC e as well. 

Let ( X , Y) be jointly distributed random variables, and let D n be a sample of size n from 
the distribution of (X, Y). Let M = M(X,Y ) be the characteristic matrix of (X,Y) and let 
M(D n ) be the sample characteristic matrix. It suffices to establish the result for a deterministic 
monotonic function of TIC b(D 71 ). We therefore show convergence of TlCsiDn)/B(n) to zero 
under the null hypothesis of independence and to oo under any alternative. Our general strategy 
for doing so is to translate known bounds on our error at estimating entries of M into bounds 
on the difference between TIC B{D n )/B(n) = Sb( u )(M(D n )) / B(n) and Sb{M)/B(ti). We then 
obtain the result by invoking Proposition 5, which implies that Sb{M) / B(n) is zero under the 
null hypothesis but grows without bound under the alternative. 

We know from Lemma A.4 (Lemma G.l for the equicharacteristic matrix) that there exists 
some a > 0 depending only on B such that 


M{D n )k,e ~ M k / 


< O 


rr 


for all k£ < B(n ) with probability 1 — o(l). This means that with probability 1 — o(l) we have 

1 L J |TIC B (^)-S BW (M)|< 0 ( 5 |^) 

where ff n is the number of pairs (k,£) such that k£ < B{n). It can be shown by taking the 
integral of B/x with respect to x that ff n = 0(B(n) log B(n)). Therefore, the error in the 
above bound is at most O (log B (n)/ n a ) = 0(l/poly(n)) for our choice of B{n). 

We now use Proposition 5 to show that this bound gives the desired result. Under the 
null hypothesis of independence, the proposition says that S^^UM) = 0 always, and so since 
B is a growing function the bound implies that TIC#(.D n )/ B(n) —> 0 in probability. Under 
the alternative hypothesis in which (X,Y) exhibit a dependence, the proposition implies that 
S B (n)(M)/B(n) > w(l). Since B is a growing function of to, this means that for any r > 0, the 
probability that S B ( n )(M)/B(n) > r goes to 1 as to grows. In other words, TlCB(D n )/B(n) —> 
oo in probability. □ 


L Information-theoretic lemmas 


Lemma L.l. Let II and T be random variables distributed over a discrete set of states T, and let 
( 7 n) and (ifi) be their respective distributions. Let P = /(II) and Q = /(T) for some function 
/ whose image is of size B. Define 

A - 7Ti 

£i = -• 




Then for every 0 < a < 1 there exists some A > 0 such that 


\H(Q)-H(P)\<(logB)Aj2h 

i 


when |ej| < 1 — a for all i. 

Proof. We prove the claim with entropy measured in nats. A rescaling then gives the general 
result. 
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Let (pi) and (qj) be the distributions of P and Q respectively, and define 


ei = 


Qi ~ Pi 
Pi 


analogously to £{. Before proceeding, we observe that 


e *= 22 
jef- 1 ^) 



We now proceed with the argument. We have from [25] that 


\H(Q)-H(P)\ < 

Y (eiPi( 1 + In+ ^pi + O (ef)^j 



(2) 

< 

Y eipi 

i 

+ 

Y e iPi ln Pi 

i 

1 

+ 2 

Y^^ 

i 

+ 

E 0 (4) 

i 

(3) 

= 

22 e iPi ln A 

+ \ Y ^ Pi + 

E ° W) 



(4) 


i i i 


where the final equality is because epi = qi — X)j p t = 0. We proceed by bounding each 
of the terms in Equation 4 separately. 

To bound the first term, we write 


^2 epi In pi 
i 


< ~Y \ej\pi In Pi. 
1 


We then note that — YliPi In E In B, and since each of the summands has the same sign this 
means that — pi In pi < In B. We also observe that 


e,: f- 


E 

is/- 1 !*) 


Pi 


3 3 


since nj/pi < 1. Together, these two facts give 


22 \ e i\Pi In Pi < (In B) ^ 

i i 

< (In B)J2 


The second inequality is because each a is a weighted average of a set of £i and each Si enters 
into the expression of exactly one ej. 

To bound the second term, we use the fact that Pi < 1 for all i, and so 


22^22 e l- 
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We then write 


E 


i 


< 


< 


E 


E 

u'e/- 1 ^) 



E E 

1 je/- 1 ^) 


E? 2 
— £■ 

Pi J 


Ed 

3 

E°(N> 


2 


where the second line is a consequence of the convexity of f(x) = x 2 and the third line is because 
the sets f~ 1 (i ) partition T. 

To bound the third term, we write 


E°C) 


<E°(n 3 ) 

i 


and then proceed as we did with the second term, using the fact that f(x) = x 3 is convex for 
x > 0. This gives 

Y 0 (n 3 ) < Y 0 (n 3 ) = E° (N) 

iii 

completing the proof. □ 

Lemma L.2. Let {wi} C [0,1] be a set of size n with Y2i w i — 1> an d { u i} be a set of n 
non-negative numbers satisfying m = a and Ui < Wi- Then 


n 

Y wiHb 

i= 1 



< H b (a) 


where H b is the binary entropy function. 


Proof. Consider the random variable X taking values in {0,..., n} that equals 0 with probability 
1 — Wi and equals i with probability Wi for 0 < i < n. Define the random variable Y taking 
values in {0,1} by 

'0 i = 0 

Ui/wi 0 < i < n 

The function we wish to bound equals H(Y\X) < H(Y). We therefore observe that 


P (Y = 0\X = i) = 


i= 1 


Y««x> £)<x(r)- 


The result follows from the observation that 


P (Y = 0) = V P (X = i) — = V m < a. 

< in: < 


Wi 


□ 
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Lemma L.3. Let X be a random variable distributed over k states, with P (X = x) = p x . 
Let a x > 0 be such that a x = 5, and define the random variable X' by P (X' = x) = 
(p x + ol x )I (1 + 5). We have 

\H{X')-H{X)\ <H b (5) + 5logk 
where H b is the binary entropy function. 


Proof. Define a new random variable Z by 

P (Z = Ol-X 7 = x) = —^—, P (Z = 1\X' = x) = Ux . 

Px + Oix Px + Oi x 

We will use the fact that H{X'\Z = 0) = H(X) to obtain the required bound. 

To upper bound H(X') — H(X), we write 

H(X')-H(X) < H(X',Z) - H{X) 

= H(Z) + P (Z = 0) H{X'\Z = 0) + P (Z = 1) H{X'\Z = 1) - H(X) 

< H b (6) + (1 - S)H(X) + 5H{X'\Z = 1) - H(X) 

= H b {5)-5H{X) +6 log k 

< H b (5) + 5logk 

where in the fourth line we have used that H(X'\Z = 1) < log k. 

To upper bound H(X) — H(X'), we write 

H(X') + H(Z) > H(X',Z) 

> P(Z = 0)H(X'\Z = 0) 

= (l-S)H(X) 


which yields 


H(X') > (1 - S)H(X) - H b (S) 


since H(Z) = H b (5). Thus, we have 


H(X) - H(X') < 6H(X) + H b (8) < 5 log k + H b {6). 


□ 

Lemma LA. Let X be a random variable distributed over k states, with P ( X = x) = p x . 
Let a x < 0 be such that ^ \oi x \ = 5, and define the random variable X' by P (X 1 = x) = 
(Px + ol x )/ (1-5). We have 

| H(X’) - H(X) | < H b + ^- s log k 

where H b is the binary entropy function. In particular, when 5 < 1/3 we have 

| H(X') - H(X )| < H b (25) + 25\ogk. 

Proof. We observe that we can get from X’ to X by adding 5/(1 — 5) probability mass and 
rescaling. The previous lemma then gives the result. □ 
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Lemma L.5. Let X be a random variable distributed over k states, with P (X = x) = p x . 
Let a x be such that ^ \oi x \ = 6, and define the random variable X' by P ( X' = x) = ( p x + 
a x )/(l — ^2 a x ). That is, X' is the result of changing the probability of state x by a x and then 
re-normalizing to obtain a valid distribution. If 6 < 1/4, we have 

| H(X') - H(X) | < 2H b (26) + 36 log k 

where H b is the binary entropy function. 

Proof. Let <5+ be the total magnitude of all the positive a x , and let 6- be the total magnitude 
of all the negative a x . We first add all the mass we’re going to add, and apply the first of the 
previous two lemmas. Then we remove all the mass we are going to remove, and apply the 
second of the two previous lemmas. This yields a bound of 

H b (5 + ) + 5+ log k + H h ^2 - ^ + 2 - log k 

< H b (5 + ) + 5 + logk + H b (26-) + 26-logk 

< H b {25) + 6 log k + H b (25) + 26 log k 

< 2H b {26) + 36 log k 

where the first inequality is because l+(5_|_<l-|-e><2 and 2 < 26 < 1/2, and the second 

inequality is because < <5 < 26 < 1/2. □ 
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